Data analysis workflow using data inputs from AndroSensor (ver 1.9.6.3) captured at 20Hz (0.05s) interval. Unlike other sensor logger apps, most of the readings from AndroSensor has already been partially processed, thus minimal processing is required from here. Smartphone models used:
Data was taken on a bicycle on various roads and Park Connector Networks (PCN) in Singapore. PCNs are paved with asphalt paths too, making them analogous to roads for testing and also safer for the experimenter due to the lack of motorised traffic. Ideally, the sensors used (phone) should be placed as close to the centre of mass of the vehicle used as possible, and extra care should be taken to measure the exact location of each pothole, making verification easier. Also, it is recommended to used multiple duplicate sensor loggers (phones) to make generalisations between phone models and individual phones possible, and also as a redundancy backup.
Key Target Variables
Quaternion rotation of the linear acceleration and gyroscope values to world coordinates using rotation vector is important in allowing across comparison of sensor data acress different orientations and also to prevent cases of gimbal lock which would cause inconsistent results.
Initial Findings:
Change Log:
Version 2 (Analysis_02):
Version 3 (Analysis_03):
Version 4 (Analysis_04):
Version 5 (Analysis_05):
Key Findings
Based on our analysis here, we managed to achieve moderate success in detecting possible potholes through unsupervised learning methods through a combination of:
Balancing sensitivity and specificity descrimination of potholes and non-pothole events is a key concern, since many non-pothole events may result in behaviour similar to potholes.
The suspected potholes have been mapped against known potholes via Google Maps:
https://www.google.com/maps/d/edit?mid=1QEkC2QL8fVG3Dh1wiQY0WLXG1IlYsgti
This could also be done via Excel's 3D Map function, especially if continuous points are to be plotted, most notably to plot the route sections where measurements where taken. This is important since non-pothole regions lack a geotagged image of the zone.
References
# Install libraries
!pip install pandas --upgrade
!pip install pyxlsb
!pip install PyWavelets
!pip install hdbscan
# Load libraries
import pandas as pd
import numpy as np
import pywt
from sklearn.decomposition import PCA
from sklearn import preprocessing
import seaborn as sns
import matplotlib.pyplot as plt
import time
from sklearn.neighbors import LocalOutlierFactor
import hdbscan
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split
from geopy import Point
from geopy import distance
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
# Define root file directory folder where the files are being stored
# set the root directory as the string following "/content/" where you mounted the google drive
# For the top most folder "gdrive", change it to "drive"
%cd /content/drive/My Drive/Colab Notebooks/LTA_MobSense/roadAnomaly_detect/Module_Import/
# Check directory location
!pwd
# Inspect files in directory
!ls
The values we are using from AndroSensor has already been rotated to World Coordinates, so we don't need to perform quaternion rotation here.
import iirFilter_butterworth as irrFilter
This dataset contain wavelets from multiple roadpotholes collected in a survey, spliced together with buffer data added as spacers in between. The data collected was done on a bicycle with front tire suspension. A mixture of roads and park connectors were chosen for this exercise. It should be noted that the pothole data taken from roads then to be noisier due to safety restrictions in getting sufficient space to accelerate and cruise through a pothole on a busy road. DO NOT TRY THIS AT HOME!
Overview of data loaded:
# File info
#xlsfile = "../MainData/pothole_records_20200316_bike/AndroSensor_S7/combinedWaveletData_v3b.xlsb" # alternative file location
xlsfile = "../MainData/pothole_records_20200316_bike/AndroSensor_Consolidated/combinedWaveletData_v3b.xlsb" # file location
ws = "Data" # name of worksheet
# Load dataframe
df = pd.read_excel(io = xlsfile, sheet_name = ws, engine = 'pyxlsb')
# Inspect Data
df.head()
Not required, as this version's dataset has been indexed before hand
# Rename first column as a pseudo index col
#df.rename(columns={"Unnamed: 0":'SN'}, inplace=True)
# Inspect Data
df.head()
# Drop redundant acceleration values
df = df.drop(['ACCELEROMETER X (m/s²)', 'ACCELEROMETER Y (m/s²)', 'ACCELEROMETER Z (m/s²)', 'Satellites in range', 'LOCATION Altitude-google ( m)', 'LOCATION Altitude ( m)'], axis=1)
# Inspect Data
df.head()
Tagging is not required as tagging as it has been done beforehand.
Tags:
# Define Event Conditions
#cond_buffer = (df["Zone"] == 0) # event code 0
#cond_ss = (df["Zone"] == 18) # event code 2
#cond_cycling = (df["Zone"] == 19) & (df["Run"] == 1) # event code 3
#cond_hump = (df["Zone"] == 19) & (df["Run"] == 2) # event code 4
# Tag events
# Steps must be done sequentially due to the conditions set for potholes, where a range cannot be passed
#df["Event"] = np.select([cond_ss, cond_cycling, cond_hump], [2, 3, 4], default = 1) # defaults to pothole; event type 1
#df.loc[cond_buffer, "Event"] = 0 # retag buffer zones
# inspect results
df.head()
# Check results
# Event 0 - Buffer
#df.loc[(df['Event'] == 0)][0:3]
# Event 1 - Pothole
df.loc[(df['Event'] == 1)][0:3]
# Event 2 - Speed Stripes
#df.loc[(df['Event'] == 2)][0:3]
# Event 3 - Normal Cycling
#df.loc[(df['Event'] == 3)][0:3]
# Event 4 - Hump
#df.loc[(df['Event'] == 4)][0:3]
# Event 5 - Uneven surfaces
#df.loc[(df['Event'] == 9)][0:3]
# Check event distribution in percentage
print("Event Type Distribution (%)")
print("Total number of observations: " + str(len(df["SN"])))
# Event 0
eventCode = 0
distribution = round(len(df.loc[(df['Event'] == eventCode)]) / len(df["SN"]) * 100, 3)
print("Event " + str(eventCode) + ": " + str(distribution) + "%")
# Event 1
eventCode = 1
distribution = round(len(df.loc[(df['Event'] == eventCode)]) / len(df["SN"]) * 100, 3)
print("Event " + str(eventCode) + ": " + str(distribution) + "%")
# Event 2
eventCode = 2
distribution = round(len(df.loc[(df['Event'] == eventCode)]) / len(df["SN"]) * 100, 3)
print("Event " + str(eventCode) + ": " + str(distribution) + "%")
# Event 3
eventCode = 3
distribution = round(len(df.loc[(df['Event'] == eventCode)]) / len(df["SN"]) * 100, 3)
print("Event " + str(eventCode) + ": " + str(distribution) + "%")
# Event 4
eventCode = 4
distribution = round(len(df.loc[(df['Event'] == eventCode)]) / len(df["SN"]) * 100, 3)
print("Event " + str(eventCode) + ": " + str(distribution) + "%")
# Event 9
eventCode = 9
distribution = round(len(df.loc[(df['Event'] == eventCode)]) / len(df["SN"]) * 100, 3)
print("Event " + str(eventCode) + ": " + str(distribution) + "%")
# Plot key raw signal data
import matplotlib.pyplot as plt
df.plot(x='SN', y='LINEAR ACCELERATION Z (m/s²)', figsize=(30, 10), color='darkred')
df.plot(x='SN', y='GYROSCOPE X (rad/s)', figsize=(30, 10))
plt.show()
Cleaning the signal for key signals only
In another notebook, we applied Fast Fourier Transformation (FFT) what are the major signals for each event. These are the result summary. We would be primarily looking at the Linear Acceleration on the Z axis and Pitch Rotation on the Z axis as these variables best correlate with the actual motion observed when road bumps are encountered. The linear acceleration and gyroscope rotation values on the other axis will be pegged to these settings.
# Peak Freq Regions
# Coasting
# Linear Acceleration (Z axis): 1.25Hz to 1.75Hz | power: 1500 to 3000
# Pitch Rotation (X Axis): <3Hz (abnormal spike around < 0.005Hz) | power: 65 to 120
# Hump
# Linear Acceleration (Z axis): 0.75Hz to 1.75Hz | power: 200 to 490
# Pitch Rotation (X Axis): <0.5Hz, 1.75Hz, 2.9Hz | power: 15 to 38
# Speed Stripes
# Linear Acceleration (Z axis): 0.8Hz to 1.6Hz | power: 400 to 1100
# Pitch Rotation (X Axis): <0.75Hz, 1.9Hz | power: 30 to 53
# Potholes
# Linear Acceleration (Z axis): <5Hz | power: 1500 to 2500
# Pitch Rotation (X Axis): 1.4Hz to 4Hz | power: 80 to 165
# This means that in our filter settings, these are the estimated signal bands to filter in
# Do note that the figures listed here are merely starting points to consider.
# It is quite likely that the faster the speed of the object, the higher the frequency observed
# Linear Acceleration (Z axis): 1.8Hz to 5Hz
# Pitch Rotation (X Axis): 3Hz to 4Hz
# Sample rate(in Hz).
fs = 20
# Desired cutoff frequencies
# These settings are primarily tagged to the Z-Axis Linear Acceleration
lowcut_acc1 = 2.5
highcut_acc1 = 5
#lowcut_acc2 = 3
#highcut_acc2 = 5
# These settings are primarily tagged to the X-Axis Gyroscope Rotation
lowcut_gyro1 = 3
highcut_gyro1 = 5
#lowcut_gyro2 = 3.5
#highcut_gyro2 = 5
# Generate filtered results
# Different levels of filtering applied
df['LINEAR ACCELERATION X_f1 (m/s²)'] = irrFilter.butter_bandpass_zerophase(df['LINEAR ACCELERATION X (m/s²)'], lowcut_acc1, highcut_acc1, fs, order=3)
#df['LINEAR ACCELERATION X_f2 (m/s²)'] = irrFilter.butter_bandpass_zerophase(df['LINEAR ACCELERATION X (m/s²)'], lowcut_acc2, highcut_acc2, fs, order=3)
df['LINEAR ACCELERATION Y_f1 (m/s²)'] = irrFilter.butter_bandpass_zerophase(df['LINEAR ACCELERATION Y (m/s²)'], lowcut_acc1, highcut_acc1, fs, order=3)
#df['LINEAR ACCELERATION Y_f2 (m/s²)'] = irrFilter.butter_bandpass_zerophase(df['LINEAR ACCELERATION Y (m/s²)'], lowcut_acc2, highcut_acc2, fs, order=3)
df['LINEAR ACCELERATION Z_f1 (m/s²)'] = irrFilter.butter_bandpass_zerophase(df['LINEAR ACCELERATION Z (m/s²)'], lowcut_acc1, highcut_acc1, fs, order=3)
#df['LINEAR ACCELERATION Z_f2 (m/s²)'] = irrFilter.butter_bandpass_zerophase(df['LINEAR ACCELERATION Z (m/s²)'], lowcut_acc2, highcut_acc2, fs, order=3)
# Inspect filtered results
df.head()
# Plot results
t = df['SN']
plt.figure()
plt.clf()
plt.figure(figsize=(30, 18))
plt.title("Linear Acceleration (X-Axis)")
plt.plot(t, df['LINEAR ACCELERATION X (m/s²)'], label='Noisy signal')
plt.plot(t, df['LINEAR ACCELERATION X_f1 (m/s²)']+100, label='Filtered signal Zero-Phase Band 1 (+100)')
#plt.plot(t, df['LINEAR ACCELERATION X_f2 (m/s²)']+150, label='Filtered signal Zero-Phase Band 2 (+150)')
plt.xlabel('Index (20 Hz interval)')
#plt.hlines([-a, a], 0, T, linestyles='--')
plt.grid(True)
plt.axis('tight')
plt.legend(loc='lower right')
plt.show()
# Plot results
t = df['SN']
plt.figure()
plt.clf()
plt.figure(figsize=(30, 18))
plt.title("Linear Acceleration (Y-Axis)")
plt.plot(t, df['LINEAR ACCELERATION Y (m/s²)'], label='Noisy signal')
plt.plot(t, df['LINEAR ACCELERATION Y_f1 (m/s²)']+100, label='Filtered signal Zero-Phase Band 1 (+100)')
#plt.plot(t, df['LINEAR ACCELERATION Y_f2 (m/s²)']+150, label='Filtered signal Zero-Phase Band 2 (+150)')
plt.xlabel('Index (20 Hz interval)')
#plt.hlines([-a, a], 0, T, linestyles='--')
plt.grid(True)
plt.axis('tight')
plt.legend(loc='lower right')
plt.show()
# Plot results
t = df['SN']
plt.figure()
plt.clf()
plt.figure(figsize=(30, 18))
plt.title("Linear Acceleration (Z-Axis)")
plt.plot(t, df['LINEAR ACCELERATION Z (m/s²)'], label='Noisy signal')
plt.plot(t, df['LINEAR ACCELERATION Z_f1 (m/s²)']+100, label='Filtered signal Zero-Phase Band 1 (+100)')
#plt.plot(t, df['LINEAR ACCELERATION Z_f2 (m/s²)']+150, label='Filtered signal Zero-Phase Band 2 (+150)')
plt.xlabel('Index (20 Hz interval)')
#plt.hlines([-a, a], 0, T, linestyles='--')
plt.grid(True)
plt.axis('tight')
plt.legend(loc='lower right')
plt.show()
# Generate filtered results
# Different levels of filtering applied
df['GYROSCOPE X_f1 (rad/s)'] = irrFilter.butter_bandpass_zerophase(df['GYROSCOPE X (rad/s)'], lowcut_gyro1, highcut_gyro1, fs, order=3)
#df['GYROSCOPE X_f2 (rad/s)'] = irrFilter.butter_bandpass_zerophase(df['GYROSCOPE X (rad/s)'], lowcut_gyro2, highcut_gyro2, fs, order=3)
df['GYROSCOPE Y_f1 (rad/s)'] = irrFilter.butter_bandpass_zerophase(df['GYROSCOPE Y (rad/s)'], lowcut_gyro1, highcut_gyro1, fs, order=3)
#df['GYROSCOPE Y_f2 (rad/s)'] = irrFilter.butter_bandpass_zerophase(df['GYROSCOPE Y (rad/s)'], lowcut_gyro2, highcut_gyro2, fs, order=3)
df['GYROSCOPE Z_f1 (rad/s)'] = irrFilter.butter_bandpass_zerophase(df['GYROSCOPE Z (rad/s)'], lowcut_gyro1, highcut_gyro1, fs, order=3)
#df['GYROSCOPE Z_f2 (rad/s)'] = irrFilter.butter_bandpass_zerophase(df['GYROSCOPE Z (rad/s)'], lowcut_gyro2, highcut_gyro2, fs, order=3)
# Inspect filtered results
df.head()
# Plot results
t = df['SN']
plt.figure()
plt.clf()
plt.figure(figsize=(30, 15))
plt.title("Gyroscope Rotation (X-Axis)")
plt.plot(t, df['GYROSCOPE X (rad/s)'], label='Noisy signal')
plt.plot(t, df['GYROSCOPE X_f1 (rad/s)']+10, label='Filtered signal Zero-Phase Band 1 (+10)')
#plt.plot(t, df['GYROSCOPE X_f2 (rad/s)']+15, label='Filtered signal Zero-Phase Band 2 (+15)')
plt.xlabel('Index (20 Hz interval)')
#plt.hlines([-a, a], 0, T, linestyles='--')
plt.grid(True)
plt.axis('tight')
plt.legend(loc='lower right')
plt.show()
# Plot results
t = df['SN']
plt.figure()
plt.clf()
plt.figure(figsize=(30, 15))
plt.title("Gyroscope Rotation (Y-Axis)")
plt.plot(t, df['GYROSCOPE Y (rad/s)'], label='Noisy signal')
plt.plot(t, df['GYROSCOPE Y_f1 (rad/s)']+10, label='Filtered signal Zero-Phase Band 1 (+10)')
#plt.plot(t, df['GYROSCOPE Y_f2 (rad/s)']+15, label='Filtered signal Zero-Phase Band 2 (+15)')
plt.xlabel('Index (20 Hz interval)')
#plt.hlines([-a, a], 0, T, linestyles='--')
plt.grid(True)
plt.axis('tight')
plt.legend(loc='lower right')
plt.show()
# Plot results
t = df['SN']
plt.figure()
plt.clf()
plt.figure(figsize=(30, 15))
plt.title("Gyroscope Rotation (Z-Axis)")
plt.plot(t, df['GYROSCOPE Z (rad/s)'], label='Noisy signal')
plt.plot(t, df['GYROSCOPE Z_f1 (rad/s)']+10, label='Filtered signal Zero-Phase Band 1 (+10)')
#plt.plot(t, df['GYROSCOPE Z_f2 (rad/s)']+15, label='Filtered signal Zero-Phase Band 2 (+15)')
plt.xlabel('Index (20 Hz interval)')
#plt.hlines([-a, a], 0, T, linestyles='--')
plt.grid(True)
plt.axis('tight')
plt.legend(loc='lower right')
plt.show()
Feature extraction at the frequency, time and wavelet domains
To choose the wavelet family as the mother wavelet, you will need to pick a mother wavelet that has similar energy, entropy and shape to the target wavelet signal. For most part, this involves a lot of trial and error to find a wavelet family which matches the target signal's profile.
In our case, previous studies found that the db3 wavelets work well in mimicking the signal of bumps and potholes, so we'll attempt to find a continuous wavelet that is most similar as a first cut.
# Define candidate mother wavelets that best mimick target signal
# These mother wavelet best mimicks the DB3 wavelet which is similar to our target signal.
wavelet1 = "cgau4"
wavelet2 = "cgau8"
# cgau4 seems to give a stronger amplitude signals than gau1 and cgau1
# cgau4 seems to give a stronger amplitude signals than gaus4
# cgau8 seems to give a stronger amplitude signal than cgau8
# Determine sampling scale
# f = scale2frequency(wavelet, scale)/sampling_period
# Sampling period (dt) = sampling frequency (fs) in seconds
# since our sampling frequency is in Hz, we'll need to convert it to seconds
dt = 1 / fs
# Define scale to test
# The scale roughly translates to how much the wavelet is stretched to fit the time window
# A scale of 1 means that the wavelet window is the same as sampling frequency (fs); this is an impossible to determine due to the Nyquist Frequency requirement
# A scale of n means that the wavelet window is n * sampling frequency (fs)
WaveletDuration = 2 # estimate duration of the wavelet in seconds
upperRange = 2 * fs
scale = np.arange(1, upperRange)
scale
# Guesstimate Wavelet Scale based on Wavelet Frequency Output
# Wavelet frequency must be between 1 < fw < (0.5 * fs)
# Based on our sampling frequency of 100Hz, we can only use scales whose wavelet frequencies are under 50
# Therefore any we should try scales that are 2 or more.
# Wavelet Frequency 1
fw1 = pywt.scale2frequency(wavelet1, scale) / dt
# Inspect results
np.round(fw1, decimals = 3)
# Wavelet Frequency 2
fw2 = pywt.scale2frequency(wavelet2, scale) / dt
# Inspect results
np.round(fw2, decimals = 3)
The following general principles are important to keep in mind when interpreting CWT coefficients.
Cone of influence — Depending on the scale, the CWT coefficient at a point can be affected by signal values at points far removed. You have to take into account the support of the wavelet at specific scales. Use conofinf to determine the cone of influence. Not all wavelets are equal in their support. For example, the Haar wavelet has smaller support at all scales than the sym4 wavelet.
Detecting abrupt transitions — Wavelets are very useful for detecting abrupt changes in a signal. Abrupt changes in a signal produce relatively large wavelet coefficients (in absolute value) centered around the discontinuity at all scales. Because of the support of the wavelet, the set of CWT coefficients affected by the singularity increases with increasing scale. Recall this is the definition of the cone of influence. The most precise localization of the discontinuity based on the CWT coefficients is obtained at the smallest scales.
Detecting smooth signal features — Smooth signal features produce relatively large wavelet coefficients at scales where the oscillation in the wavelet correlates best with the signal feature. For sinusoidal oscillations, the CWT coefficients display an oscillatory pattern at scales where the oscillation in the wavelet approximates the period of the sine wave.
#pywt.cwt(data, scales, wavelet)
# each row corresponds the the number of scales sampled
# each col of the coefficients output corresponds to the number observations in our data
cwt_coef, cwt_freqs = pywt.cwt(df['LINEAR ACCELERATION Z_f1 (m/s²)'], scale, wavelet1, sampling_period = dt)
# Check dimension of arrays created
print(cwt_coef.shape)
print(cwt_freqs.shape)
# Inspect frequency data sample
cwt_freqs[0:100]
# Inspect coefficient data sample
cwt_coef[::,0:3]
print(scale.min())
print(scale.max())
print(cwt_coef.shape[1])
# Plot Graph (2D) - original
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
#plt.figure(figsize=(30,10))
#plt.imshow(cwt_coef, cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef.shape[1]-1), scale.max(), scale.min()-0.5])
#cbar = plt.colorbar()
#cbar.set_label(label='Amplitude', size=18)
#cbar.ax.tick_params(labelsize=13)
#plt.title("CWT (1st Order Gaussian Wavelet Applied) on Linear Acceleration (Z-Axis)", fontsize = 20)
#plt.xlabel("Index (20Hz Interval)", fontsize = 18)
#plt.ylabel("Adjusted Scale", fontsize = 18)
#plt.tick_params(axis='both', which='major', labelsize = 13)
#plt.show()
# Plot Graph (2D) - Log Scale Applied to Amplitude
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
# Since the amplitude has quite a wide minimum and maximum range from the average, we would scale it using the log of the absolute value of the magnitude instead
plt.figure(figsize=(35,10))
plt.imshow(np.log(abs(cwt_coef)), cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef.shape[1]-1), scale.max(), scale.min()-0.5])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale)', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (4th Order Complex Gaussian Wavelet Applied) on Linear Acceleration (Z-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
# Plot Graph (2D) - Log Scale Applied to Amplitude (Zoomed In version)
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
plt.figure(figsize=(35,10))
plt.imshow(np.log(abs(cwt_coef))[0:20,], cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef.shape[1]-1), 20, scale.min()])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (4th Order Complex Gaussian Wavelet Applied) on Linear Acceleration (Z-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
# Threshold suppression of background noise (baseline)
test0 = np.log(abs(cwt_coef))
test0[test0 < 0] = -10
# Plot Graph (2D) - Log Scale Applied to Amplitude
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
# Since the amplitude has quite a wide minimum and maximum range from the average, we would scale it using the log of the absolute value of the magnitude instead
plt.figure(figsize=(35,10))
plt.imshow(test0[0:10,], cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef.shape[1]-1), 10, scale.min()])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale)', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (4th Order Complex Gaussian Wavelet Applied) on Linear Acceleration (Z-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
# Threshold suppression of background noise
test1 = np.log(abs(cwt_coef))
test1[test1 < 0.4] = -10
# Plot Graph (2D) - Log Scale Applied to Amplitude
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
# Since the amplitude has quite a wide minimum and maximum range from the average, we would scale it using the log of the absolute value of the magnitude instead
plt.figure(figsize=(35,10))
plt.imshow(test1[0:10,], cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef.shape[1]-1), 10, scale.min()])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale)', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (4th Order Complex Gaussian Wavelet Applied) on Linear Acceleration (Z-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
Potential candidate *
#pywt.cwt(data, scales, wavelet)
# each row corresponds the the number of scales sampled
# each col of the coefficients output corresponds to the number observations in our data
cwt_coef, cwt_freqs = pywt.cwt(df['LINEAR ACCELERATION Z_f1 (m/s²)'], scale, wavelet2, sampling_period = dt)
# Check dimension of arrays created
print(cwt_coef.shape)
print(cwt_freqs.shape)
# Inspect frequency data sample
cwt_freqs[0:100]
# Inspect coefficient data sample
cwt_coef[::,0:3]
print(scale.min())
print(scale.max())
print(cwt_coef.shape[1])
# Plot Graph (2D) - Log Scale Applied to Amplitude
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
# Since the amplitude has quite a wide minimum and maximum range from the average, we would scale it using the log of the absolute value of the magnitude instead
plt.figure(figsize=(35,10))
plt.imshow(np.log(abs(cwt_coef)), cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef.shape[1]-1), scale.max(), scale.min()-0.5])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale)', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (9th Order Complex Gaussian Wavelet Applied) on Linear Acceleration (Z-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
# Plot Graph (2D) - Log Scale Applied to Amplitude (Zoomed In version)
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
plt.figure(figsize=(35,10))
plt.imshow(np.log(abs(cwt_coef))[0:20,], cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef.shape[1]-1), 20, scale.min()])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale)', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (8th Order Complex Gaussian Wavelet Applied) on Linear Acceleration (Z-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
# Threshold suppression of background noise (baseline)
test0 = np.log(abs(cwt_coef))
test0[test0 < 0] = -10
# Plot Graph (2D) - Log Scale Applied to Amplitude (Zoomed In version)
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
plt.figure(figsize=(35,10))
plt.imshow(test0[0:10,], cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef.shape[1]-1), 10, scale.min()])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale)', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (8th Order Complex Gaussian Wavelet Applied) on Linear Acceleration (Z-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
# Threshold suppression of background noise
test1 = np.log(abs(cwt_coef))
test1[test1 < 0.5] = -10
# Plot Graph (2D) - Log Scale Applied to Amplitude (Zoomed In version)
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
plt.figure(figsize=(35,10))
plt.imshow(test0[0:10,], cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef.shape[1]-1), 10, scale.min()])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale)', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (8th Order Complex Gaussian Wavelet Applied) on Linear Acceleration (Z-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
#pywt.cwt(data, scales, wavelet)
# each row corresponds the the number of scales sampled
# each col of the coefficients output corresponds to the number observations in our data
cwt_coef, cwt_freqs = pywt.cwt(df['GYROSCOPE X_f1 (rad/s)'], scale, wavelet1, sampling_period = dt)
# Check dimension of arrays created
print(cwt_coef.shape)
print(cwt_freqs.shape)
# Inspect frequency data sample
cwt_freqs[0:100]
# Inspect coefficient data sample
cwt_coef[::,0:3]
print(scale.min())
print(scale.max())
print(cwt_coef.shape[1])
# Plot Graph (2D) - original
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
#plt.figure(figsize=(30,10))
#plt.imshow(cwt_coef, cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef.shape[1]-1), scale.max(), scale.min()-0.5])
#cbar = plt.colorbar()
#cbar.set_label(label='Amplitude', size=18)
#cbar.ax.tick_params(labelsize=13)
#plt.title("CWT (1st Order Gaussian Wavelet Applied) on Gyroscope Rotatation (X-Axis)", fontsize = 20)
#plt.xlabel("Index (20Hz Interval)", fontsize = 18)
#plt.ylabel("Adjusted Scale", fontsize = 18)
#plt.tick_params(axis='both', which='major', labelsize = 13)
#plt.show()
# Plot Graph (2D) - Log Scale Applied to Amplitude
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
# Since the amplitude has quite a wide minimum and maximum range from the average, we would scale it using the log of the absolute value of the magnitude instead
plt.figure(figsize=(35,10))
plt.imshow(np.log(abs(cwt_coef)), cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef.shape[1]-1), scale.max(), scale.min()-0.5])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale)', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (4th Order Complex Gaussian Wavelet Applied) on Gyroscope Rotation (X-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
# Plot Graph (2D) - Log Scale Applied to Amplitude (Zoomed In version)
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
plt.figure(figsize=(35,10))
plt.imshow(np.log(abs(cwt_coef))[0:10,], cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef.shape[1]-1), 10, scale.min()])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (4th Order Complex Gaussian Wavelet Applied) on Gyroscope Rotation (X-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
# Threshold suppression of background noise (baseline)
test0 = np.log(abs(cwt_coef))
test0[test0 < -5] = -20
# Plot Graph (2D) - Log Scale Applied to Amplitude (Zoomed In version)
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
plt.figure(figsize=(35,10))
plt.imshow(test0[0:6,], cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef.shape[1]-1), 6, scale.min()])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (4th Order Complex Gaussian Wavelet Applied) on Gyroscope Rotation (X-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
# Threshold suppression of background noise
test1 = np.log(abs(cwt_coef))
test1[test1 < -4] = -20
# Plot Graph (2D) - Log Scale Applied to Amplitude (Zoomed In version)
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
plt.figure(figsize=(35,10))
plt.imshow(test1[0:6,], cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef.shape[1]-1), 6, scale.min()])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (4th Order Complex Gaussian Wavelet Applied) on Gyroscope Rotation (X-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
Potential Candidate *
#pywt.cwt(data, scales, wavelet)
# each row corresponds the the number of scales sampled
# each col of the coefficients output corresponds to the number observations in our data
cwt_coef, cwt_freqs = pywt.cwt(df['GYROSCOPE X_f1 (rad/s)'], scale, wavelet2, sampling_period = dt)
# Check dimension of arrays created
print(cwt_coef.shape)
print(cwt_freqs.shape)
# Inspect frequency data sample
cwt_freqs[0:100]
# Inspect coefficient data sample
cwt_coef[::,0:3]
print(scale.min())
print(scale.max())
print(cwt_coef.shape[1])
# Plot Graph (2D) - Log Scale Applied to Amplitude
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
# Since the amplitude has quite a wide minimum and maximum range from the average, we would scale it using the log of the absolute value of the magnitude instead
plt.figure(figsize=(35,10))
plt.imshow(np.log(abs(cwt_coef)), cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef.shape[1]-1), scale.max(), scale.min()-0.5])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale)', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (8th Order Complex Gaussian Wavelet Applied) on Gyroscope Rotation (X-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
# Plot Graph (2D) - Log Scale Applied to Amplitude (Zoomed In version)
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
plt.figure(figsize=(35,10))
plt.imshow(np.log(abs(cwt_coef))[0:10,], cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef.shape[1]-1), 10, scale.min()])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale)', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (8th Order Complex Gaussian Wavelet Applied) on Gyroscope Rotation (X-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
# Threshold suppression of background noise (baseline)
test0 = np.log(abs(cwt_coef))
test0[test0 < -5] = -20
# Plot Graph (2D) - Log Scale Applied to Amplitude (Zoomed In version)
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
plt.figure(figsize=(35,10))
plt.imshow(test0[0:7,], cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef.shape[1]-1), 7, scale.min()])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale)', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (8th Order Complex Gaussian Wavelet Applied) on Gyroscope Rotation (X-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
# Threshold suppression of background noise
test1 = np.log(abs(cwt_coef))
test1[test1 < -4] = -20
# Plot Graph (2D) - Log Scale Applied to Amplitude (Zoomed In version)
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
plt.figure(figsize=(35,10))
plt.imshow(test1[0:7,], cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef.shape[1]-1), 7, scale.min()])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale)', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (8th Order Complex Gaussian Wavelet Applied) on Gyroscope Rotation (X-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
We will only be performing CWT on the Linear Acceleration on the Z Axis and Gyroscope Roll Rotation on the X axis as those variables capture the motion of a vehicle encountering a pothole.
For the non-key acceleration and gyroscope that does not directly capture the action of hitting a pothole, we will be converting them into magnitude format as a bit of dimension reduction and to unpeg the values from any fixed orientation. For instance, linear acceleration on the x & y axes should be direction agnostic since a vehicle is free to travel in any heading on a horizontal plane. Likewise, for azimuth/yaw gyroscope rotations, the vehicle may turn in any direction on a horizontal plane, and it is not indicative of any road anomally per se. Lastly, for a bicycle (or any single tracked vehicle), such vehicles are generally in a constant state of wobble (roll rotation) due to the act of balancing on those vehicles.
# Get magnitude of xy-axes linear acceleration
# we will peg the to using the filtered results using the same filters applied to the z-axis (main target)
# F1
df["LA_xy_mag"] = np.sqrt(df['LINEAR ACCELERATION X_f1 (m/s²)']**2 + df['LINEAR ACCELERATION Y_f1 (m/s²)']**2)
# Get magnitude of yz-axes gyroscope rotation
# we will peg the to using the filtered results using the same filters applied to the x-axis (main target)
# F1
# This step has been dropped as it doesn't correlate well with pothole detection as it is too sensitive to noise
#df["GY_yz_mag"] = np.sqrt(df['GYROSCOPE Y_f1 (rad/s)']**2 + df['GYROSCOPE Z_f1 (rad/s)']**2)
# Inspect results
df.head()
Peaks seem to line up pretty well. Potential candidate to support anomaly detection.
# Plot results
t = df['SN']
plt.figure()
plt.clf()
plt.figure(figsize=(30, 18))
plt.title("Filtered Linear Acceleration Results")
plt.plot(t, df['LINEAR ACCELERATION X_f1 (m/s²)'], label='Fltered X-Axis Signal Zero-Phase Band 1')
plt.plot(t, df['LA_xy_mag']+40, label='Filtered Horizontal Acceleration Magnitude Signal Zero-Phase Band 1')
plt.xlabel('Index (20 Hz interval)')
#plt.hlines([-a, a], 0, T, linestyles='--')
plt.grid(True)
plt.axis('tight')
plt.legend(loc='lower right')
plt.show()
# We will be performing CWT on the magnitude values
# the settings used will be pegged to the main target values
#df["LA_xy_mag"] W2
#df["GY_yz_mag"] W1
# We will be performing CWT on the magnitude values
# the settings used will be pegged to the main target values
# Perform CWT on Linear Acceleration
#pywt.cwt(data, scales, wavelet)
# each row corresponds the the number of scales sampled
# each col of the coefficients output corresponds to the number observations in our data
cwt_coef_LA_XY_F1W2, cwt_freqs_LA_XY_F1W2 = pywt.cwt(df["LA_xy_mag"], scale, wavelet2, sampling_period = dt)
# Get the natural log (base e) of the magnitude of the coefficient
cwt_coef_LA_XY_F1W2_log = np.log(abs(cwt_coef_LA_XY_F1W2))
# List of scale names
append_str = "LA_XY_CWT - "
cwt_freqs_LA_XY_F1W2_name = [append_str + str(sub.round(2)) for sub in cwt_freqs_LA_XY_F1W2]
# Check dimension of arrays created
print(cwt_coef_LA_XY_F1W2.shape)
print(cwt_freqs_LA_XY_F1W2.shape)
# Inspect Output Sample
print(cwt_freqs_LA_XY_F1W2.round(2))
print(cwt_coef_LA_XY_F1W2_log[0:5,0:3])
print(cwt_freqs_LA_XY_F1W2_name)
The results are seem promising in terms of filtering out normal cycling activity, but it is not so good in filtering out humps and speed strips from potholes. Potential road anomalies can be seen in the dark orange / red regions. Probably only the first 10 scales may be valuable for modelling.
# Plot Graph (2D)
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
# Since the amplitude has quite a wide minimum and maximum range from the average, we would scale it using the log of the absolute value of the magnitude instead
plt.figure(figsize=(35,10))
plt.imshow(np.log(abs(cwt_coef_LA_XY_F1W2)), cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef_LA_XY_F1W2.shape[1]-1), scale.max(), scale.min()-0.5])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale)', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (8th Order Complex Gaussian Wavelet Applied) on Horizontal Linear Acceleration Magnitude (XY-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
# Plot Graph (2D)
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
# Since the amplitude has quite a wide minimum and maximum range from the average, we would scale it using the log of the absolute value of the magnitude instead
plt.figure(figsize=(35,10))
plt.imshow(np.log(abs(cwt_coef_LA_XY_F1W2))[0:20,], cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef_LA_XY_F1W2.shape[1]-1), 20, scale.min()-0.5])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale)', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (8th Order Complex Gaussian Wavelet Applied) on Horizontal Linear Acceleration Magnitude (XY-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
# Threshold suppression of background noise (baseline)
test0 = np.log(abs(cwt_coef_LA_XY_F1W2))
test0[test0 < -2] = -10
# Plot Graph (2D)
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
# Since the amplitude has quite a wide minimum and maximum range from the average, we would scale it using the log of the absolute value of the magnitude instead
plt.figure(figsize=(35,10))
plt.imshow(test0[0:39,], cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef_LA_XY_F1W2.shape[1]-1), 39, scale.min()-0.5])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale)', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (8th Order Complex Gaussian Wavelet Applied) on Horizontal Linear Acceleration Magnitude (XY-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
# Threshold suppression of background noise (baseline)
test1 = np.log(abs(cwt_coef_LA_XY_F1W2))
test1[test1 < 0.7] = -10
# Plot Graph (2D)
# To interprete this graph, look out for regions with sharp changes in amplitude (dark blue or dark red regions),
# which may hint of potential potholes or bumps
# Identify the observation # affected, and find the GPS coordinates of those regions.
# Since the amplitude has quite a wide minimum and maximum range from the average, we would scale it using the log of the absolute value of the magnitude instead
plt.figure(figsize=(35,10))
plt.imshow(test1[0:39,], cmap = 'coolwarm', aspect = 'auto', extent = [0, (cwt_coef_LA_XY_F1W2.shape[1]-1), 39, scale.min()-0.5])
cbar = plt.colorbar()
cbar.set_label(label='Amplitude (Log Scale)', size=18)
cbar.ax.tick_params(labelsize=13)
plt.title("CWT (8th Order Complex Gaussian Wavelet Applied) on Horizontal Linear Acceleration Magnitude (XY-Axis)", fontsize = 20)
plt.xlabel("Index (20Hz Interval)", fontsize = 18)
plt.ylabel("Adjusted Scale", fontsize = 18)
plt.tick_params(axis='both', which='major', labelsize = 13)
plt.clim(-2.5, 2.5) # Set min-max range of color bar
plt.show()
del test0
del test1
# Perform CWT on candidate signal (Linear Acceleration - Z axis)
#pywt.cwt(data, scales, wavelet)
# each row corresponds the the number of scales sampled
# each col of the coefficients output corresponds to the number observations in our data
cwt_coef_LA_Z_F1W2, cwt_freqs_LA_Z_F1W2 = pywt.cwt(df['LINEAR ACCELERATION Z_f1 (m/s²)'], scale, wavelet2, sampling_period = dt)
# Get the natural log (base e) of the magnitude of the coefficient
cwt_coef_LA_Z_F1W2_log = np.log(abs(cwt_coef_LA_Z_F1W2))
# List of scale names
# Note unlike the other variables, we are setting the function to round at 3dp instead of 2dp, to avoid variables with the same name at higher scales (wavelet frequencies) used
# This is especially since we're using all the scales generated for our analysis here
append_str = "LA_Z_CWT - "
cwt_freqs_LA_Z_F1W2_name = [append_str + str(sub.round(3)) for sub in cwt_freqs_LA_Z_F1W2]
# Check dimension of arrays created
print(cwt_coef_LA_Z_F1W2.shape)
print(cwt_freqs_LA_Z_F1W2.shape)
# Inspect Output Sample
print(cwt_freqs_LA_Z_F1W2.round(2))
print(cwt_coef_LA_Z_F1W2_log[0:5,0:3])
print(cwt_freqs_LA_Z_F1W2_name)
# Perform CWT on candidate signal (Gyroscope Rotation - X axis)
#pywt.cwt(data, scales, wavelet)
# each row corresponds the the number of scales sampled
# each col of the coefficients output corresponds to the number observations in our data
cwt_coef_GY_X_F1W2, cwt_freqs_GY_X_F1W2 = pywt.cwt(df['GYROSCOPE X_f1 (rad/s)'], scale, wavelet1, sampling_period = dt)
# Get the natural log (base e) of the magnitude of the coefficient
cwt_coef_GY_X_F1W2_log = np.log(abs(cwt_coef_GY_X_F1W2))
# List of scale names
append_str = "GY_X_CWT - "
cwt_freqs_GY_X_F1W2_name = [append_str + str(sub.round(2)) for sub in cwt_freqs_GY_X_F1W2]
# Check dimension of arrays created
print(cwt_coef_GY_X_F1W2.shape)
print(cwt_freqs_GY_X_F1W2.shape)
# Inspect Output Sample
print(cwt_freqs_GY_X_F1W2.round(2))
print(cwt_coef_GY_X_F1W2_log[0:5,0:3])
print(cwt_freqs_GY_X_F1W2_name)
# Threshold suppression of background noise (baseline)
# #impute values which fail to an arbitrarily low value for better contrast
cwt_coef_LA_XY_cleaned = cwt_coef_LA_XY_F1W2_log
cwt_coef_LA_XY_cleaned[cwt_coef_LA_XY_cleaned < -2] = -20
cwt_coef_LA_Z_cleaned = cwt_coef_LA_Z_F1W2_log
cwt_coef_LA_Z_cleaned[cwt_coef_LA_Z_cleaned < 0.5] = -20
cwt_coef_GY_X_cleaned = cwt_coef_GY_X_F1W2_log
cwt_coef_GY_X_cleaned[cwt_coef_GY_X_cleaned < -5] = -20
# Convert coefficients from an array format to a dataframe (Linear Acceleration Z-Axis)
df_LA_Z_cwt = pd.DataFrame(cwt_coef_LA_Z_cleaned).transpose()
df_LA_Z_cwt.columns = cwt_freqs_LA_Z_F1W2_name
df_LA_Z_cwt.head(3)
# Convert coefficients from an array format to a dataframe (Gyroscope Rotation X-Axis)
df_GY_X_cwt = pd.DataFrame(cwt_coef_GY_X_cleaned).transpose()
df_GY_X_cwt.columns = cwt_freqs_GY_X_F1W2_name
df_GY_X_cwt.head(3)
# Convert coefficients from an array format to a dataframe (Horizontal Acceleration Magnitude)
df_LA_XY_cwt = pd.DataFrame(cwt_coef_LA_XY_cleaned).transpose()
df_LA_XY_cwt.columns = cwt_freqs_LA_XY_F1W2_name
df_LA_XY_cwt.head(3)
# Append dataframes together
# Only append information rich zones (scales) that contains the target pattern
# For simplicity sake of coding later on, we're only using the first 10 scales (high frequency zones)
# as most of the information can be found there, and are likely related to the presence of road anomalies
df2 = pd.concat([df, df_LA_Z_cwt.iloc[:, 0:10], df_LA_XY_cwt.iloc[:, 0:10], df_GY_X_cwt.iloc[:, 0:10]], axis=1, join='inner')
df2.shape
# Convert coefficients from an array format to a dataframe (Horizontal Acceleration Magnitude)
print(df_LA_XY_cwt.shape)
print(df_LA_Z_cwt.shape)
print(df_GY_X_cwt.shape)
# Inspect dataframe
df2.head(3)
# Inspect dataframe
df2.shape
# Define stats functions
def q05 (x) :
return x.quantile(0.05)
def q25 (x) :
return x.quantile(0.25)
def q50 (x) :
return x.quantile(0.5)
def q75 (x) :
return x.quantile(0.75)
def q95 (x) :
return x.quantile(0.95)
def meanCustom (x) :
return x.mean()
def Low_Limit (x):
return x.quantile(0.5) - 1.5 * (x.quantile(0.75) - x.quantile(0.25))
def Upp_Limit (x):
return x.quantile(0.5) + 1.5 * (x.quantile(0.75) - x.quantile(0.25))
Unsurprisingly, pothole events tend to register the highest magnitude change in amplitude. Since bumps makes up a very small set of the data, likely using the q95 datapoints would be the most relevant for deciding a cut off point for thresholding. Below are the q95 values of key scales:
# Group the Gyroscope CWT Data by Event across different wavelet scales / frequencies
# Selected stats summary
# Focus more on higher value amplitude changes
# column names must match the wavelet coefficient scales appended to the main data frame
df2.groupby(
['Event']
).agg(
{
# Scales 3 to 8
cwt_freqs_GY_X_F1W2_name[0]: [q75, q95],
cwt_freqs_GY_X_F1W2_name[1]: [q75, q95],
cwt_freqs_GY_X_F1W2_name[2]: [q75, q95],
cwt_freqs_GY_X_F1W2_name[3]: [q75, q95],
cwt_freqs_GY_X_F1W2_name[4]: [q75, q95],
cwt_freqs_GY_X_F1W2_name[5]: [q75, q95],
cwt_freqs_GY_X_F1W2_name[6]: [q75, q95],
cwt_freqs_GY_X_F1W2_name[7]: [q75, q95],
cwt_freqs_GY_X_F1W2_name[8]: [q75, q95],
cwt_freqs_GY_X_F1W2_name[9]: [q75, q95]
}
)
# Group the Gyroscope CWT Data by Event across different wavelet scales / frequencies
# Full stats summary
# column names must match the wavelet coefficient scales appended to the main data frame
df_stats_GY_X_cwt = df2.groupby(
['Event']
).agg(
{
# Scales 3 to 8
cwt_freqs_GY_X_F1W2_name[0]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_GY_X_F1W2_name[1]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_GY_X_F1W2_name[2]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_GY_X_F1W2_name[3]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_GY_X_F1W2_name[4]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_GY_X_F1W2_name[5]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_GY_X_F1W2_name[6]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_GY_X_F1W2_name[7]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_GY_X_F1W2_name[8]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_GY_X_F1W2_name[9]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit]
}
)
df_stats_GY_X_cwt
Unsurprisingly, pothole events tend to register the highest magnitude change in amplitude. Since bumps makes up a very small set of the data, likely using the q75 and q95 datapoints would be the most relevant for deciding a cut off point for thresholding. Below are the q95 values of key scales:
# Group the Linear Acceleration CWT Data by Event across different wavelet scales / frequencies
# Selected stats summary
# Focus more on higher value amplitude changes
# column names must match the wavelet coefficient scales appended to the main data frame
df2.groupby(
['Event']
).agg(
{
# Scales 1 to 5
cwt_freqs_LA_Z_F1W2_name[0]: [q75, q95],
cwt_freqs_LA_Z_F1W2_name[1]: [q75, q95],
cwt_freqs_LA_Z_F1W2_name[2]: [q75, q95],
cwt_freqs_LA_Z_F1W2_name[3]: [q75, q95],
cwt_freqs_LA_Z_F1W2_name[4]: [q75, q95],
# Scales 6 to 10
cwt_freqs_LA_Z_F1W2_name[5]: [q75, q95],
cwt_freqs_LA_Z_F1W2_name[6]: [q75, q95],
cwt_freqs_LA_Z_F1W2_name[7]: [q75, q95],
cwt_freqs_LA_Z_F1W2_name[8]: [q75, q95],
cwt_freqs_LA_Z_F1W2_name[9]: [q75, q95]
}
)
# Group the Linear Acceleration CWT Data by Event across different wavelet scales / frequencies
# Full stats summary
df_stats_LA_Z_cwt = df2.groupby(
['Event']
).agg(
{
# Scales 1 to 5
cwt_freqs_LA_Z_F1W2_name[0]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_LA_Z_F1W2_name[1]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_LA_Z_F1W2_name[2]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_LA_Z_F1W2_name[3]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_LA_Z_F1W2_name[4]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
# Scales 6 to 10
cwt_freqs_LA_Z_F1W2_name[5]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_LA_Z_F1W2_name[6]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_LA_Z_F1W2_name[7]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_LA_Z_F1W2_name[8]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_LA_Z_F1W2_name[9]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit]
}
)
df_stats_LA_Z_cwt
Unsurprisingly, pothole events tend to register the highest magnitude change in amplitude. Since bumps makes up a very small set of the data, likely using the q75 and q95 datapoints would be the most relevant for deciding a cut off point for thresholding. Below are the q95 values of key scales:
# Group the Linear Acceleration CWT Data by Event across different wavelet scales / frequencies
# Selected stats summary
# Focus more on higher value amplitude changes
df2.groupby(
['Event']
).agg(
{
# Scales 1 to 5
cwt_freqs_LA_XY_F1W2_name[0]: [q75, q95],
cwt_freqs_LA_XY_F1W2_name[1]: [q75, q95],
cwt_freqs_LA_XY_F1W2_name[2]: [q75, q95],
cwt_freqs_LA_XY_F1W2_name[3]: [q75, q95],
cwt_freqs_LA_XY_F1W2_name[4]: [q75, q95],
# Scales 6 to 10
cwt_freqs_LA_XY_F1W2_name[5]: [q75, q95],
cwt_freqs_LA_XY_F1W2_name[6]: [q75, q95],
cwt_freqs_LA_XY_F1W2_name[7]: [q75, q95],
cwt_freqs_LA_XY_F1W2_name[8]: [q75, q95],
cwt_freqs_LA_XY_F1W2_name[9]: [q75, q95]
}
)
# Group the Linear Acceleration CWT Data by Event across different wavelet scales / frequencies
# Full stats summary
df_stats_LA_XY_cwt = df2.groupby(
['Event']
).agg(
{
# Scales 1 to 5
cwt_freqs_LA_XY_F1W2_name[0]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_LA_XY_F1W2_name[1]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_LA_XY_F1W2_name[2]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_LA_XY_F1W2_name[3]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_LA_XY_F1W2_name[4]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
# Scales 6 to 10
cwt_freqs_LA_XY_F1W2_name[5]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_LA_XY_F1W2_name[6]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_LA_XY_F1W2_name[7]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_LA_XY_F1W2_name[8]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit],
cwt_freqs_LA_XY_F1W2_name[9]: [Low_Limit, min, q05, q25, q50, meanCustom, q75, q95, max, Upp_Limit]
}
)
df_stats_LA_XY_cwt
We will attempt to use windowing of using the last 30 observations (1.5s for recordings at 20Hz) to determine whether there is a road anomaly or not. I've previously created a rolling window function in the irrFilter customer Python module which would compute key stats and append it as new cols to the dataframe.
We will only be applying the rolling window stats on the target filtered signals. The general ideal is to get some descriptive stats that may be indicative of a pothole event.
Note on threshold tests used:
If threshold values are -ve, that means the observation failed to hit the threshold values. Ideally the more +ve the value, the more likely that it could be a pothole.
# Set window period
windowPeriod = 30 # number of past observations
# Rolling Window Stats on Linear Acceleration (Z-Axis)
# irrFilter.windowSumStats(varX = df2['LINEAR ACCELERATION Z_f1 (m/s²)'], masterDF = df2, windowVal = windowPeriod, prefix = "LA_Z_f1").tail()
df2 = irrFilter.windowSumStats3(varX = df2['LINEAR ACCELERATION Z_f1 (m/s²)'], masterDF = df2, windowVal = windowPeriod, prefix = "LA_Z_f1", pUpp = 0.95, maxLimit = 7)
# Rolling Window Stats on Horizontal Linear Acceleration Magnitude (XY-Axis)
# irrFilter.windowSumStats(varX = df2['LA_xy_mag'], masterDF = df2, windowVal = windowPeriod, prefix = "LA_XY_f1").tail()
df2 = irrFilter.windowSumStats3(varX = df2['LA_xy_mag'], masterDF = df2, windowVal = windowPeriod, prefix = "LA_XY_f1", pUpp = 0.95, maxLimit = 10)
# Rolling Window Stats on Gyroscope Rotation (Roll; X-Axis)
# irrFilter.windowSumStats(varX = df2['GYROSCOPE X_f1 (rad/s)'], masterDF = df2, windowVal = windowPeriod, prefix = "GY_X_f2").tail()
df2 = irrFilter.windowSumStats3(varX = df2['GYROSCOPE X_f1 (rad/s)'], masterDF = df2, windowVal = windowPeriod, prefix = "GY_X_f1", pUpp = 0.95, maxLimit = 0.2)
# Convert Boolean Values to Integer format
# This would make it possible for PCA later
# Examples
#df2.iloc[:, 109:113]*1 # Select rows by index
#df2.loc[:, "GY_YZ_f2-movExceedSD":"GY_YZ_f2-movExceedUppLim"] * 1 # Select rows by name
#df2.loc[:, "LA_Z_f1-movExceedSD":"LA_Z_f1-movExceedUppLim"] = df2.loc[:, "LA_Z_f1-movExceedSD":"LA_Z_f1-movExceedUppLim"] * 1
#df2.loc[:, "LA_XY_f1-movExceedSD":"LA_XY_f1-movExceedUppLim"] = df2.loc[:, "LA_XY_f1-movExceedSD":"LA_XY_f1-movExceedUppLim"] * 1
#df2.loc[:, "GY_X_f2-movExceedSD":"GY_X_f2-movExceedUppLim"] = df2.loc[:, "GY_X_f2-movExceedSD":"GY_X_f2-movExceedUppLim"] * 1
#df2.loc[:, "GY_YZ_f2-movExceedSD":"GY_YZ_f2-movExceedUppLim"] = df2.loc[:, "GY_YZ_f2-movExceedSD":"GY_YZ_f2-movExceedUppLim"] * 1
# Inspect results
df2.head()
In this case we shall be performing thresholding using the CWT ocefficients we created earlier. We shall attempt thresholding using the Upp_Limit (outlier upper bound by median + 1.5 * IQR) and q95 (95th percentile) values. At this stage, I would suspect that using the q95 values would give the best separation in results. In order to retain as much info as possible, we would be getting the amount of deviation to the different threshold points. Ideally, a pothole signal should have values around 0 to +ve infinity, whereas non-pothole signals would have values of -ve infinity to 0.
# Multi Index Selection of Threshold values
# Alternative means of representation
pd.options.display.max_columns = None
#print(df_stats_LA_XY_cwt.iloc[:, df_stats_LA_XY_cwt.columns.get_level_values(1)=='q75'].round(3).values) # If you just one the val
print(df_stats_LA_Z_cwt.iloc[:, df_stats_LA_Z_cwt.columns.get_level_values(1)=='Upp_Limit'].round(3))
print(df_stats_LA_Z_cwt.iloc[:, df_stats_LA_Z_cwt.columns.get_level_values(1)=='q95'].round(3))
print(df_stats_LA_XY_cwt.iloc[:, df_stats_LA_XY_cwt.columns.get_level_values(1)=='Upp_Limit'].round(3))
print(df_stats_LA_XY_cwt.iloc[:, df_stats_LA_XY_cwt.columns.get_level_values(1)=='q95'].round(3))
print(df_stats_GY_X_cwt.iloc[:, df_stats_GY_X_cwt.columns.get_level_values(1)=='Upp_Limit'].round(3))
print(df_stats_GY_X_cwt.iloc[:, df_stats_GY_X_cwt.columns.get_level_values(1)=='q95'].round(3))
# Multi Index Selection of Threshold values
phThresholds_Upp_Limit_LA_Z_cwt = df_stats_LA_Z_cwt.iloc[:, df_stats_LA_Z_cwt.columns.get_level_values(1)=='Upp_Limit'].round(3).values[1].tolist()
phThresholds_q95_LA_Z_cwt = df_stats_LA_Z_cwt.iloc[:, df_stats_LA_Z_cwt.columns.get_level_values(1)=='q95'].round(3).values[1].tolist()
phThresholds_Upp_Limit_LA_XY_cwt = df_stats_LA_XY_cwt.iloc[:, df_stats_LA_XY_cwt.columns.get_level_values(1)=='Upp_Limit'].round(3).values[1].tolist()
phThresholds_q95_LA_XY_cwt = df_stats_LA_XY_cwt.iloc[:, df_stats_LA_XY_cwt.columns.get_level_values(1)=='q95'].round(3).values[1].tolist()
phThresholds_Upp_Limit_GY_X_cwt = df_stats_GY_X_cwt.iloc[:, df_stats_GY_X_cwt.columns.get_level_values(1)=='Upp_Limit'].round(3).values[1].tolist()
phThresholds_q95_GY_X_cwt = df_stats_GY_X_cwt.iloc[:, df_stats_GY_X_cwt.columns.get_level_values(1)=='q95'].round(3).values[1].tolist()
print("Upp_Limit and q95 threshold values for LA_Z_cwt")
print(phThresholds_Upp_Limit_LA_Z_cwt)
print(phThresholds_q95_LA_Z_cwt)
print()
print("Upp_Limit and q95 threshold values for LA_XY_cwt")
print(phThresholds_Upp_Limit_LA_XY_cwt)
print(phThresholds_q95_LA_XY_cwt)
print()
print("Upp_Limit and q95 threshold values for GY_X_cwt")
print(phThresholds_Upp_Limit_GY_X_cwt)
print(phThresholds_q95_GY_X_cwt)
(cwt_freqs_GY_X_F1W2_name[2] + "_q95test")
Note:
At the point of writing, I couldn't figure out a more efficient way of looping through this, so this is the interim fix.
# Threshold tests for LA_Z_cwt
# Scales 1 to 5
df2[(cwt_freqs_LA_Z_F1W2_name[0] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_Z_F1W2_name[0]] - phThresholds_Upp_Limit_LA_Z_cwt[0])
df2[(cwt_freqs_LA_Z_F1W2_name[0] + "_q95test")] = (df2[cwt_freqs_LA_Z_F1W2_name[0]] - phThresholds_q95_LA_Z_cwt[0])
df2[(cwt_freqs_LA_Z_F1W2_name[1] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_Z_F1W2_name[1]] - phThresholds_Upp_Limit_LA_Z_cwt[1])
df2[(cwt_freqs_LA_Z_F1W2_name[1] + "_q95test")] = (df2[cwt_freqs_LA_Z_F1W2_name[1]] - phThresholds_q95_LA_Z_cwt[1])
df2[(cwt_freqs_LA_Z_F1W2_name[2] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_Z_F1W2_name[2]] - phThresholds_Upp_Limit_LA_Z_cwt[2])
df2[(cwt_freqs_LA_Z_F1W2_name[2] + "_q95test")] = (df2[cwt_freqs_LA_Z_F1W2_name[2]] - phThresholds_q95_LA_Z_cwt[2])
df2[(cwt_freqs_LA_Z_F1W2_name[3] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_Z_F1W2_name[3]] - phThresholds_Upp_Limit_LA_Z_cwt[3])
df2[(cwt_freqs_LA_Z_F1W2_name[3] + "_q95test")] = (df2[cwt_freqs_LA_Z_F1W2_name[3]] - phThresholds_q95_LA_Z_cwt[3])
df2[(cwt_freqs_LA_Z_F1W2_name[4] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_Z_F1W2_name[4]] - phThresholds_Upp_Limit_LA_Z_cwt[4])
df2[(cwt_freqs_LA_Z_F1W2_name[4] + "_q95test")] = (df2[cwt_freqs_LA_Z_F1W2_name[4]] - phThresholds_q95_LA_Z_cwt[4])
# Scales 5 to 10
df2[(cwt_freqs_LA_Z_F1W2_name[5] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_Z_F1W2_name[5]] - phThresholds_Upp_Limit_LA_Z_cwt[5])
df2[(cwt_freqs_LA_Z_F1W2_name[5] + "_q95test")] = (df2[cwt_freqs_LA_Z_F1W2_name[5]] - phThresholds_q95_LA_Z_cwt[5])
df2[(cwt_freqs_LA_Z_F1W2_name[6] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_Z_F1W2_name[6]] - phThresholds_Upp_Limit_LA_Z_cwt[6])
df2[(cwt_freqs_LA_Z_F1W2_name[6] + "_q95test")] = (df2[cwt_freqs_LA_Z_F1W2_name[6]] - phThresholds_q95_LA_Z_cwt[6])
df2[(cwt_freqs_LA_Z_F1W2_name[7] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_Z_F1W2_name[7]] - phThresholds_Upp_Limit_LA_Z_cwt[7])
df2[(cwt_freqs_LA_Z_F1W2_name[7] + "_q95test")] = (df2[cwt_freqs_LA_Z_F1W2_name[7]] - phThresholds_q95_LA_Z_cwt[7])
df2[(cwt_freqs_LA_Z_F1W2_name[8] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_Z_F1W2_name[8]] - phThresholds_Upp_Limit_LA_Z_cwt[8])
df2[(cwt_freqs_LA_Z_F1W2_name[8] + "_q95test")] = (df2[cwt_freqs_LA_Z_F1W2_name[8]] - phThresholds_q95_LA_Z_cwt[8])
df2[(cwt_freqs_LA_Z_F1W2_name[9] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_Z_F1W2_name[9]] - phThresholds_Upp_Limit_LA_Z_cwt[9])
df2[(cwt_freqs_LA_Z_F1W2_name[9] + "_q95test")] = (df2[cwt_freqs_LA_Z_F1W2_name[9]] - phThresholds_q95_LA_Z_cwt[9])
# Threshold tests for LA_XY_cwt
# Scales 1 to 5
df2[(cwt_freqs_LA_XY_F1W2_name[0] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_XY_F1W2_name[0]] - phThresholds_Upp_Limit_LA_XY_cwt[0])
df2[(cwt_freqs_LA_XY_F1W2_name[0] + "_q95test")] = (df2[cwt_freqs_LA_XY_F1W2_name[0]] - phThresholds_q95_LA_XY_cwt[0])
df2[(cwt_freqs_LA_XY_F1W2_name[1] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_XY_F1W2_name[1]] - phThresholds_Upp_Limit_LA_XY_cwt[1])
df2[(cwt_freqs_LA_XY_F1W2_name[1] + "_q95test")] = (df2[cwt_freqs_LA_XY_F1W2_name[1]] - phThresholds_q95_LA_XY_cwt[1])
df2[(cwt_freqs_LA_XY_F1W2_name[2] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_XY_F1W2_name[2]] - phThresholds_Upp_Limit_LA_XY_cwt[2])
df2[(cwt_freqs_LA_XY_F1W2_name[2] + "_q95test")] = (df2[cwt_freqs_LA_XY_F1W2_name[2]] - phThresholds_q95_LA_XY_cwt[2])
df2[(cwt_freqs_LA_XY_F1W2_name[3] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_XY_F1W2_name[3]] - phThresholds_Upp_Limit_LA_XY_cwt[3])
df2[(cwt_freqs_LA_XY_F1W2_name[3] + "_q95test")] = (df2[cwt_freqs_LA_XY_F1W2_name[3]] - phThresholds_q95_LA_XY_cwt[3])
df2[(cwt_freqs_LA_XY_F1W2_name[4] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_XY_F1W2_name[4]] - phThresholds_Upp_Limit_LA_XY_cwt[4])
df2[(cwt_freqs_LA_XY_F1W2_name[4] + "_q95test")] = (df2[cwt_freqs_LA_XY_F1W2_name[4]] - phThresholds_q95_LA_XY_cwt[4])
# Scales 5 to 10
df2[(cwt_freqs_LA_XY_F1W2_name[5] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_XY_F1W2_name[5]] - phThresholds_Upp_Limit_LA_XY_cwt[5])
df2[(cwt_freqs_LA_XY_F1W2_name[5] + "_q95test")] = (df2[cwt_freqs_LA_XY_F1W2_name[5]] - phThresholds_q95_LA_XY_cwt[5])
df2[(cwt_freqs_LA_XY_F1W2_name[6] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_XY_F1W2_name[6]] - phThresholds_Upp_Limit_LA_XY_cwt[6])
df2[(cwt_freqs_LA_XY_F1W2_name[6] + "_q95test")] = (df2[cwt_freqs_LA_XY_F1W2_name[6]] - phThresholds_q95_LA_XY_cwt[6])
df2[(cwt_freqs_LA_XY_F1W2_name[7] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_XY_F1W2_name[7]] - phThresholds_Upp_Limit_LA_XY_cwt[7])
df2[(cwt_freqs_LA_XY_F1W2_name[7] + "_q95test")] = (df2[cwt_freqs_LA_XY_F1W2_name[7]] - phThresholds_q95_LA_XY_cwt[7])
df2[(cwt_freqs_LA_XY_F1W2_name[8] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_XY_F1W2_name[8]] - phThresholds_Upp_Limit_LA_XY_cwt[8])
df2[(cwt_freqs_LA_XY_F1W2_name[8] + "_q95test")] = (df2[cwt_freqs_LA_XY_F1W2_name[8]] - phThresholds_q95_LA_XY_cwt[8])
df2[(cwt_freqs_LA_XY_F1W2_name[9] + "_Upp_Limittest")] = (df2[cwt_freqs_LA_XY_F1W2_name[9]] - phThresholds_Upp_Limit_LA_XY_cwt[9])
df2[(cwt_freqs_LA_XY_F1W2_name[9] + "_q95test")] = (df2[cwt_freqs_LA_XY_F1W2_name[9]] - phThresholds_q95_LA_XY_cwt[9])
# Threshold tests for GY_X_cwt
# Scales 1 to 5
df2[(cwt_freqs_GY_X_F1W2_name[0] + "_Upp_Limittest")] = (df2[cwt_freqs_GY_X_F1W2_name[2]] - phThresholds_Upp_Limit_GY_X_cwt[0])
df2[(cwt_freqs_GY_X_F1W2_name[0] + "_q95test")] = (df2[cwt_freqs_GY_X_F1W2_name[2]] - phThresholds_q95_GY_X_cwt[0])
df2[(cwt_freqs_GY_X_F1W2_name[1] + "_Upp_Limittest")] = (df2[cwt_freqs_GY_X_F1W2_name[3]] - phThresholds_Upp_Limit_GY_X_cwt[1])
df2[(cwt_freqs_GY_X_F1W2_name[1] + "_q95test")] = (df2[cwt_freqs_GY_X_F1W2_name[3]] - phThresholds_q95_GY_X_cwt[1])
df2[(cwt_freqs_GY_X_F1W2_name[2] + "_Upp_Limittest")] = (df2[cwt_freqs_GY_X_F1W2_name[4]] - phThresholds_Upp_Limit_GY_X_cwt[2])
df2[(cwt_freqs_GY_X_F1W2_name[2] + "_q95test")] = (df2[cwt_freqs_GY_X_F1W2_name[4]] - phThresholds_q95_GY_X_cwt[2])
df2[(cwt_freqs_GY_X_F1W2_name[3] + "_Upp_Limittest")] = (df2[cwt_freqs_GY_X_F1W2_name[5]] - phThresholds_Upp_Limit_GY_X_cwt[3])
df2[(cwt_freqs_GY_X_F1W2_name[3] + "_q95test")] = (df2[cwt_freqs_GY_X_F1W2_name[5]] - phThresholds_q95_GY_X_cwt[3])
df2[(cwt_freqs_GY_X_F1W2_name[4] + "_Upp_Limittest")] = (df2[cwt_freqs_GY_X_F1W2_name[6]] - phThresholds_Upp_Limit_GY_X_cwt[4])
df2[(cwt_freqs_GY_X_F1W2_name[4] + "_q95test")] = (df2[cwt_freqs_GY_X_F1W2_name[6]] - phThresholds_q95_GY_X_cwt[4])
# Scales 6 to 10
df2[(cwt_freqs_GY_X_F1W2_name[5] + "_Upp_Limittest")] = (df2[cwt_freqs_GY_X_F1W2_name[7]] - phThresholds_Upp_Limit_GY_X_cwt[5])
df2[(cwt_freqs_GY_X_F1W2_name[5] + "_q95test")] = (df2[cwt_freqs_GY_X_F1W2_name[7]] - phThresholds_q95_GY_X_cwt[5])
df2[(cwt_freqs_GY_X_F1W2_name[6] + "_Upp_Limittest")] = (df2[cwt_freqs_GY_X_F1W2_name[2]] - phThresholds_Upp_Limit_GY_X_cwt[0])
df2[(cwt_freqs_GY_X_F1W2_name[6] + "_q95test")] = (df2[cwt_freqs_GY_X_F1W2_name[2]] - phThresholds_q95_GY_X_cwt[0])
df2[(cwt_freqs_GY_X_F1W2_name[7] + "_Upp_Limittest")] = (df2[cwt_freqs_GY_X_F1W2_name[3]] - phThresholds_Upp_Limit_GY_X_cwt[1])
df2[(cwt_freqs_GY_X_F1W2_name[7] + "_q95test")] = (df2[cwt_freqs_GY_X_F1W2_name[3]] - phThresholds_q95_GY_X_cwt[1])
df2[(cwt_freqs_GY_X_F1W2_name[8] + "_Upp_Limittest")] = (df2[cwt_freqs_GY_X_F1W2_name[4]] - phThresholds_Upp_Limit_GY_X_cwt[2])
df2[(cwt_freqs_GY_X_F1W2_name[8] + "_q95test")] = (df2[cwt_freqs_GY_X_F1W2_name[4]] - phThresholds_q95_GY_X_cwt[2])
df2[(cwt_freqs_GY_X_F1W2_name[9] + "_Upp_Limittest")] = (df2[cwt_freqs_GY_X_F1W2_name[5]] - phThresholds_Upp_Limit_GY_X_cwt[3])
df2[(cwt_freqs_GY_X_F1W2_name[9] + "_q95test")] = (df2[cwt_freqs_GY_X_F1W2_name[5]] - phThresholds_q95_GY_X_cwt[3])
# Inspect results
df2.head()
Optimisation of the dataset for modelling purposes - fast speed and lower memory requirements
# Junk Redundant Variables
del df
del df_stats_GY_X_cwt
del df_stats_LA_Z_cwt
del df_stats_LA_XY_cwt
del df_LA_Z_cwt
del df_LA_XY_cwt
del cwt_coef_LA_XY_cleaned
del cwt_coef_LA_Z_cleaned
del cwt_coef_GY_X_cleaned
#del df_GY_X_cwt
# List all columns
list(df2)
# Junk Redundant Columns (variables)
df2 = df2.drop(columns=['LINEAR ACCELERATION X (m/s²)',
'LINEAR ACCELERATION Y (m/s²)',
'LINEAR ACCELERATION Z (m/s²)',
'GYROSCOPE X (rad/s)',
'GYROSCOPE Y (rad/s)',
'GYROSCOPE Z (rad/s)',
'GYROSCOPE Y_f1 (rad/s)',
'GYROSCOPE Z_f1 (rad/s)'
])
# Inspect results
df2.head()
# Inspect results
df2.shape
# Impute NAN to 0
# PCA cannot accept missing values
df2 = df2.fillna(0)
# Inspect and test selection
df2.iloc[: , 14::]
# Inspect and test selection
df2.iloc[: , 14::].shape
# Perform data scaling
scaled_data = preprocessing.StandardScaler().fit_transform(df2.iloc[: , 14::].values)
scaled_data
First run of PCA
# Perform PCA by selecting the number of components desired
pca = PCA(n_components = 44)
# Apply PCA on selected variables only
principalComponents = pca.fit_transform(scaled_data)
# Pring total explained variance of the PCA components
print("Total explained variance of the PCA components (%)")
print(sum(pca.explained_variance_ratio_*100).round(3))
# Print explained variance of the PCA components
print("Explained variance of each PCA component (%)")
print((pca.explained_variance_ratio_*100).round(3))
# Print sum of eigenvalues of PCA components (as n increases, the sum of the eigenvalues should equal the sum of variables)
print("Sum of Eigenvalues")
print(sum(pca.explained_variance_).round(3))
# Print eigenvalues of PCA components
print("Eigenvalues of each PCA component")
print(pca.explained_variance_.round(3))
# Get defualt PCA cutoff by eigen value more than or equal to 1
pca_pcNum = [i for i in range(len(pca.explained_variance_)) if pca.explained_variance_[i] >= 1][-1]+1
pca_pcNum
Rules of thumb when selecting the number of PCA components:
# Perform PCA by selecting the number of components desired
pca = PCA(n_components = pca_pcNum) # uses default PCA cutoff criteria to select number of PC
# Apply PCA on selected variables only
principalComponents = pca.fit_transform(scaled_data)
# Pring total explained variance of the PCA components
print("Total explained variance of the PCA components (%)")
print(sum(pca.explained_variance_ratio_*100).round(3))
# Print explained variance of the PCA components
print("Explained variance of each PCA component (%)")
print((pca.explained_variance_ratio_*100).round(3))
# Print sum of eigenvalues of PCA components (as n increases, the sum of the eigenvalues should equal the sum of variables)
print("Sum of Eigenvalues")
print(sum(pca.explained_variance_).round(3))
# Print eigenvalues of PCA components
print("Eigenvalues of each PCA component")
print(pca.explained_variance_.round(3))
# Generate list of pc numbers
# this is based on the default PCA number settings
pcList = np.arange(1, pca_pcNum+1, 1).tolist()
pcList
# Generate list of PC Names
pcNameList = ["PC_" + str(s) for s in pcList]
pcNameList
# Convert PCA components to dataframe
principalDf = pd.DataFrame(data = principalComponents,
columns = pcNameList)
# Inspect Results
principalDf.head()
# Inspect Results
principalDf.shape
# Append Principal Components to main data
df3 = pd.concat([df2.iloc[: , 0:14], principalDf], axis = 1)
# Inspect Results
df3.head()
Find relationship in PCA plots. Unsurprisingly, the PC 01 and PC 02 combo provided the best separation in event detection since they carry the most information. There also seem to be a slight quadratic relationship with PC 02. Generally, it is observed that potholes (Event 1) tend to have more extreme values than other events. Pothole events also tend to be extreme outlier values too.
# PCA plot by Event between PC 01 and PC 02
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (20,10))
sns.scatterplot(x = pcNameList[0], y = pcNameList[1], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 02')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# PCA plot by Event between PC 01 and PC 03
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (20,10))
sns.scatterplot(x = pcNameList[0], y = pcNameList[2], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 03')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
Now we will be attempting different methods to tease out suspected pothole events. Due to the memory limitations on Google Colab (25GB), it is best to load each section block by block, and delete any unused variables, else the script will crash and you'll have to start over. Hence the multiple breakpoints.
Thresholding of Principal Components to Separate Pothole Events
# Define global thresholhold points
pcCutoff = 0.002
pcCutoff_Normal = 1 - pcCutoff
pcCutoff_TopTail = 1 - (pcCutoff / 2)
pcCutoff_LowTail = (pcCutoff / 2)
# PCA plot by Event between PC 01 and PC 02
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[1], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 02')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC01 = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[0]], pcCutoff_Normal) # We will define this once only
cutoff_PC02 = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[1]], pcCutoff_Normal)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC02 + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str(round(cutoff_PC02,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC02,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 03
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[2], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 03')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC03a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[2]], pcCutoff_TopTail)
cutoff_PC03b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[2]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC03a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC03a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC03b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC03b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC03a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC03b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 04
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[3], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 04')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC04a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[3]], pcCutoff_TopTail)
cutoff_PC04b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[3]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC04a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC04a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC04b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC04b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC04a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC04b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 05
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[4], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 05')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC05a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[4]], pcCutoff_TopTail)
cutoff_PC05b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[4]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC05a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC05a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC05b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC05b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC05a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC05b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 06
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[5], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 06')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC06a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[5]], pcCutoff_TopTail)
cutoff_PC06b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[5]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC06a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC06a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC06b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC06b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC06a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC06b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 07
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[6], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 07')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC07a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[6]], pcCutoff_TopTail)
cutoff_PC07b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[6]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC07a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC07a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC07b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC07b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC07a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC07b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 08
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[7], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 08')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC08a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[7]], pcCutoff_TopTail)
cutoff_PC08b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[7]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC08a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC08a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC08b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC08b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC08a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC08b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 09
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[8], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 09')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC09a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[8]], pcCutoff_TopTail)
cutoff_PC09b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[8]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC09a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC09a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC09b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC09b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC09a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC09b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 10
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[9], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 10')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC10a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[9]], pcCutoff_TopTail)
cutoff_PC10b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[9]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC09a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC10a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC09b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC10b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC10a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC10b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 11
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[10], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 11')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC11a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[10]], pcCutoff_TopTail)
cutoff_PC11b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[10]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC11a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC11a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC11b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC11b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC11a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC11b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 12
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[11], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 12')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC12a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[11]], pcCutoff_TopTail)
cutoff_PC12b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[11]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC12a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC12a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC12b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC12b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC12a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC12b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 13
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[12], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 13')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC13a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[12]], pcCutoff_TopTail)
cutoff_PC13b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[12]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC13a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC13a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC13b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC13b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC13a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC13b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 14
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[13], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 14')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC14a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[13]], pcCutoff_TopTail)
cutoff_PC14b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[13]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC14a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC14a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC14b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC14b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC14a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC14b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 15
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[14], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 15')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC15a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[14]], pcCutoff_TopTail)
cutoff_PC15b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[14]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC15a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC15a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC15b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC15b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC15a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC15b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 16
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[14], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 16')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC16a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[15]], pcCutoff_TopTail)
cutoff_PC16b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[15]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC16a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC16a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC16b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC16b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC16a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC16b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 17
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[14], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 17')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC17a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[16]], pcCutoff_TopTail)
cutoff_PC17b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[16]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC17a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC17a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC17b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC17b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC17a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC17b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 18
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[14], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 18')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC18a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[17]], pcCutoff_TopTail)
cutoff_PC18b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[17]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC18a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC18a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC18b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC18b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC18a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC18b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 19
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[14], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 19')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC19a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[18]], pcCutoff_TopTail)
cutoff_PC19b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[18]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC19a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC19a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC19b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC19b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC19a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC19b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 20
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[14], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 20')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC20a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[19]], pcCutoff_TopTail)
cutoff_PC20b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[19]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC20a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC20a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC20b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC20b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC20a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC20b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 21
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[14], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 21')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC21a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[20]], pcCutoff_TopTail)
cutoff_PC21b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[20]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC21a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC21a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC21b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC21b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC21a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC21b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 22
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[14], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 22')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC22a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[21]], pcCutoff_TopTail)
cutoff_PC22b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[21]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC22a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC22a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC22b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC22b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC22a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC22b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 23
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[14], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 23')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC23a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[22]], pcCutoff_TopTail)
cutoff_PC23b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[22]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC23a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC23a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC23b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC23b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC23a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC23b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 24
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[14], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 24')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC24a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[23]], pcCutoff_TopTail)
cutoff_PC24b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[23]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC24a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC24a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC24b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC24b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC24a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC24b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 25
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[14], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 25')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC25a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[24]], pcCutoff_TopTail)
cutoff_PC25b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[24]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC25a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC25a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC25b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC25b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC25a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC25b,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 26
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[14], data = df3, hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 26')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define cutoff points
cutoff_PC26a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[25]], pcCutoff_TopTail)
cutoff_PC26b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[25]], pcCutoff_LowTail)
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 1, cutoff_PC26a + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC26a,1)) + ")"))
plt.text(cutoff_PC01 + 1, cutoff_PC26b + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC26b,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC26a,0) # Draw horizontal cutoff point
plt.axhline(cutoff_PC26b,0) # Draw horizontal cutoff point
A weighted threshold is performed to determine whether an observation is a pothole or not. The weights are determined by the explained variance of each PC generated during the PCA step. The closer the value is to 1, the more likely that the observation is a pothole.
# List of PC thresholds
pcThresholdList = [cutoff_PC01, cutoff_PC02,
cutoff_PC03a, cutoff_PC03b,
cutoff_PC04a, cutoff_PC04b,
cutoff_PC05a, cutoff_PC05b,
cutoff_PC06a, cutoff_PC06b,
cutoff_PC07a, cutoff_PC07b,
cutoff_PC08a, cutoff_PC08b,
cutoff_PC09a, cutoff_PC09b,
cutoff_PC10a, cutoff_PC10b,
cutoff_PC11a, cutoff_PC11b,
cutoff_PC12a, cutoff_PC12b,
cutoff_PC13a, cutoff_PC13b,
cutoff_PC14a, cutoff_PC14b,
cutoff_PC15a, cutoff_PC15b,
cutoff_PC16a, cutoff_PC16b,
cutoff_PC17a, cutoff_PC17b,
cutoff_PC18a, cutoff_PC18b,
cutoff_PC19a, cutoff_PC19b,
cutoff_PC20a, cutoff_PC20b,
cutoff_PC21a, cutoff_PC21b,
cutoff_PC22a, cutoff_PC22b,
cutoff_PC23a, cutoff_PC23b,
cutoff_PC24a, cutoff_PC24b,
cutoff_PC25a, cutoff_PC25b,
cutoff_PC26a, cutoff_PC26b
]
# inspect list
pcThresholdList[0:5]
# inspect list
len(pcThresholdList)
# Define thresholding code
df3["probThres_pc"] = (
((df3[pcNameList[0]] >= pcThresholdList[0]) * 1) * pca.explained_variance_ratio_[0] +
((df3[pcNameList[1]] >= pcThresholdList[1]) * 1) * pca.explained_variance_ratio_[1] +
((df3[pcNameList[2]] >= pcThresholdList[2]) * 1) * pca.explained_variance_ratio_[2] +
((df3[pcNameList[2]] <= pcThresholdList[3]) * 1) * pca.explained_variance_ratio_[2] +
((df3[pcNameList[3]] >= pcThresholdList[4]) * 1) * pca.explained_variance_ratio_[3] +
((df3[pcNameList[3]] <= pcThresholdList[5]) * 1) * pca.explained_variance_ratio_[3] +
((df3[pcNameList[4]] >= pcThresholdList[6]) * 1) * pca.explained_variance_ratio_[4] +
((df3[pcNameList[4]] <= pcThresholdList[7]) * 1) * pca.explained_variance_ratio_[4] +
((df3[pcNameList[5]] >= pcThresholdList[8]) * 1) * pca.explained_variance_ratio_[5] +
((df3[pcNameList[5]] <= pcThresholdList[9]) * 1) * pca.explained_variance_ratio_[5] +
((df3[pcNameList[6]] >= pcThresholdList[10]) * 1) * pca.explained_variance_ratio_[6] +
((df3[pcNameList[6]] <= pcThresholdList[11]) * 1) * pca.explained_variance_ratio_[6] +
((df3[pcNameList[7]] >= pcThresholdList[12]) * 1) * pca.explained_variance_ratio_[7] +
((df3[pcNameList[7]] <= pcThresholdList[13]) * 1) * pca.explained_variance_ratio_[7] +
((df3[pcNameList[8]] >= pcThresholdList[14]) * 1) * pca.explained_variance_ratio_[8] +
((df3[pcNameList[8]] <= pcThresholdList[15]) * 1) * pca.explained_variance_ratio_[8] +
((df3[pcNameList[9]] >= pcThresholdList[16]) * 1) * pca.explained_variance_ratio_[9] +
((df3[pcNameList[9]] <= pcThresholdList[17]) * 1) * pca.explained_variance_ratio_[9] +
((df3[pcNameList[10]] >= pcThresholdList[18]) * 1) * pca.explained_variance_ratio_[10] +
((df3[pcNameList[10]] <= pcThresholdList[19]) * 1) * pca.explained_variance_ratio_[10] +
((df3[pcNameList[11]] >= pcThresholdList[20]) * 1) * pca.explained_variance_ratio_[11] +
((df3[pcNameList[11]] <= pcThresholdList[21]) * 1) * pca.explained_variance_ratio_[11] +
((df3[pcNameList[12]] >= pcThresholdList[22]) * 1) * pca.explained_variance_ratio_[12] +
((df3[pcNameList[12]] <= pcThresholdList[23]) * 1) * pca.explained_variance_ratio_[12] +
((df3[pcNameList[13]] >= pcThresholdList[24]) * 1) * pca.explained_variance_ratio_[13] +
((df3[pcNameList[13]] <= pcThresholdList[25]) * 1) * pca.explained_variance_ratio_[13] +
((df3[pcNameList[14]] >= pcThresholdList[26]) * 1) * pca.explained_variance_ratio_[14] +
((df3[pcNameList[14]] <= pcThresholdList[27]) * 1) * pca.explained_variance_ratio_[14] +
((df3[pcNameList[15]] >= pcThresholdList[28]) * 1) * pca.explained_variance_ratio_[15] +
((df3[pcNameList[15]] <= pcThresholdList[29]) * 1) * pca.explained_variance_ratio_[15] +
((df3[pcNameList[16]] >= pcThresholdList[30]) * 1) * pca.explained_variance_ratio_[16] +
((df3[pcNameList[16]] <= pcThresholdList[31]) * 1) * pca.explained_variance_ratio_[16] +
((df3[pcNameList[17]] >= pcThresholdList[32]) * 1) * pca.explained_variance_ratio_[17] +
((df3[pcNameList[17]] <= pcThresholdList[33]) * 1) * pca.explained_variance_ratio_[17] +
((df3[pcNameList[18]] >= pcThresholdList[34]) * 1) * pca.explained_variance_ratio_[18] +
((df3[pcNameList[18]] <= pcThresholdList[35]) * 1) * pca.explained_variance_ratio_[18] +
((df3[pcNameList[19]] >= pcThresholdList[36]) * 1) * pca.explained_variance_ratio_[19] +
((df3[pcNameList[19]] <= pcThresholdList[37]) * 1) * pca.explained_variance_ratio_[19] +
((df3[pcNameList[20]] >= pcThresholdList[38]) * 1) * pca.explained_variance_ratio_[20] +
((df3[pcNameList[20]] <= pcThresholdList[39]) * 1) * pca.explained_variance_ratio_[20] +
((df3[pcNameList[21]] >= pcThresholdList[40]) * 1) * pca.explained_variance_ratio_[21] +
((df3[pcNameList[21]] <= pcThresholdList[41]) * 1) * pca.explained_variance_ratio_[21] +
((df3[pcNameList[22]] >= pcThresholdList[42]) * 1) * pca.explained_variance_ratio_[22] +
((df3[pcNameList[22]] <= pcThresholdList[43]) * 1) * pca.explained_variance_ratio_[22] +
((df3[pcNameList[23]] >= pcThresholdList[44]) * 1) * pca.explained_variance_ratio_[23] +
((df3[pcNameList[23]] <= pcThresholdList[45]) * 1) * pca.explained_variance_ratio_[23] +
((df3[pcNameList[24]] >= pcThresholdList[46]) * 1) * pca.explained_variance_ratio_[24] +
((df3[pcNameList[24]] <= pcThresholdList[47]) * 1) * pca.explained_variance_ratio_[24] +
((df3[pcNameList[25]] >= pcThresholdList[48]) * 1) * pca.explained_variance_ratio_[25] +
((df3[pcNameList[25]] <= pcThresholdList[49]) * 1) * pca.explained_variance_ratio_[25]
)
# Inspect results
df3.head()
# Check if probabilities are calculated properly
max(df3["probThres_pc"])
Aim to reduce the amount of contamination from other event types in the plots. Based on our test splots, a threshold of around 0.73 to 0.75 would be sufficient to make a good guess of the existence of potholes. Do note that since there are less event types in the filtered data, the colour codes may have changed.
# Set Threshold probability values
pcSumThresholdProb = sum(pca.explained_variance_ratio_[0:9]).round(3)
meetsThresholdProb1 = df3['probThres_pc'] >= pcSumThresholdProb
meetsThresholdProb2 = df3['probThres_pc'] >= 0.8
# Inspect variable
pcSumThresholdProb
# Get Count of observations that meets threshold
print(df3[meetsThresholdProb1].shape[0])
print(df3[meetsThresholdProb2].shape[0])
# Get % count of potential contamination from non-tagged pothole events that met threshold
nonTaggedPH1 = (df3["Event"] != 1) & (meetsThresholdProb1)
nonTaggedPH2 = (df3["Event"] != 1) & (meetsThresholdProb2)
print(round(((df3[nonTaggedPH1].shape[0] / df3[meetsThresholdProb1].shape[0]) * 100), 3))
print(round(((df3[nonTaggedPH2].shape[0] / df3[meetsThresholdProb2].shape[0]) * 100), 3))
# Get a % Count of observations that met threshold
print(round(((df3[meetsThresholdProb1].shape[0] / df3.shape[0]) * 100), 3))
print(round(((df3[meetsThresholdProb2].shape[0] / df3.shape[0]) * 100), 3))
# PCA plot by Event between PC 01 and PC 02
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[1], data = df3[meetsThresholdProb1], hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 02')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 0.5, cutoff_PC02 + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC02,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC02,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 02
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[1], data = df3[meetsThresholdProb2], hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 02')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 0.5, cutoff_PC02 + 1, ("Cutoff Intersect Point: (" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC02,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC02,0) # Draw horizontal cutoff point
Thresholding of Principal Components to Separate Pothole Events. Less aggressive thresholds used to filter eigenvectors of principal components.
# Define global thresholhold points
pcCutoff = 0.01
pcCutoff_Normal = 1 - pcCutoff
pcCutoff_TopTail = 1 - (pcCutoff / 2)
pcCutoff_LowTail = (pcCutoff / 2)
# Define cutoff points
cutoff_PC01 = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[0]], pcCutoff_Normal)
cutoff_PC02 = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[1]], pcCutoff_Normal)
cutoff_PC03a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[2]], pcCutoff_TopTail)
cutoff_PC03b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[2]], pcCutoff_LowTail)
cutoff_PC04a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[3]], pcCutoff_TopTail)
cutoff_PC04b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[3]], pcCutoff_LowTail)
cutoff_PC05a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[4]], pcCutoff_TopTail)
cutoff_PC05b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[4]], pcCutoff_LowTail)
cutoff_PC06a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[5]], pcCutoff_TopTail)
cutoff_PC06b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[5]], pcCutoff_LowTail)
cutoff_PC07a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[6]], pcCutoff_TopTail)
cutoff_PC07b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[6]], pcCutoff_LowTail)
cutoff_PC08a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[7]], pcCutoff_TopTail)
cutoff_PC08b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[7]], pcCutoff_LowTail)
cutoff_PC09a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[8]], pcCutoff_TopTail)
cutoff_PC09b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[8]], pcCutoff_LowTail)
cutoff_PC10a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[9]], pcCutoff_TopTail)
cutoff_PC10b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[9]], pcCutoff_LowTail)
cutoff_PC11a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[10]], pcCutoff_TopTail)
cutoff_PC11b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[10]], pcCutoff_LowTail)
cutoff_PC12a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[11]], pcCutoff_TopTail)
cutoff_PC12b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[11]], pcCutoff_LowTail)
cutoff_PC13a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[12]], pcCutoff_TopTail)
cutoff_PC13b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[12]], pcCutoff_LowTail)
cutoff_PC14a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[13]], pcCutoff_TopTail)
cutoff_PC14b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[13]], pcCutoff_LowTail)
cutoff_PC15a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[14]], pcCutoff_TopTail)
cutoff_PC15b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[14]], pcCutoff_LowTail)
cutoff_PC16a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[15]], pcCutoff_TopTail)
cutoff_PC16b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[15]], pcCutoff_LowTail)
cutoff_PC17a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[16]], pcCutoff_TopTail)
cutoff_PC17b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[16]], pcCutoff_LowTail)
cutoff_PC18a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[17]], pcCutoff_TopTail)
cutoff_PC18b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[17]], pcCutoff_LowTail)
cutoff_PC19a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[18]], pcCutoff_TopTail)
cutoff_PC19b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[18]], pcCutoff_LowTail)
cutoff_PC20a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[19]], pcCutoff_TopTail)
cutoff_PC20b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[19]], pcCutoff_LowTail)
cutoff_PC21a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[20]], pcCutoff_TopTail)
cutoff_PC21b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[20]], pcCutoff_LowTail)
cutoff_PC22a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[21]], pcCutoff_TopTail)
cutoff_PC22b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[21]], pcCutoff_LowTail)
cutoff_PC23a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[22]], pcCutoff_TopTail)
cutoff_PC23b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[22]], pcCutoff_LowTail)
cutoff_PC24a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[23]], pcCutoff_TopTail)
cutoff_PC24b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[23]], pcCutoff_LowTail)
cutoff_PC25a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[24]], pcCutoff_TopTail)
cutoff_PC25b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[24]], pcCutoff_LowTail)
cutoff_PC26a = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[25]], pcCutoff_TopTail)
cutoff_PC26b = np.quantile(df3.loc[df3["Event"] != 1][pcNameList[25]], pcCutoff_LowTail)
# List of PC thresholds
pcThresholdList = [cutoff_PC01, cutoff_PC02,
cutoff_PC03a, cutoff_PC03b,
cutoff_PC04a, cutoff_PC04b,
cutoff_PC05a, cutoff_PC05b,
cutoff_PC06a, cutoff_PC06b,
cutoff_PC07a, cutoff_PC07b,
cutoff_PC08a, cutoff_PC08b,
cutoff_PC09a, cutoff_PC09b,
cutoff_PC10a, cutoff_PC10b,
cutoff_PC11a, cutoff_PC11b,
cutoff_PC12a, cutoff_PC12b,
cutoff_PC13a, cutoff_PC13b,
cutoff_PC14a, cutoff_PC14b,
cutoff_PC15a, cutoff_PC15b,
cutoff_PC16a, cutoff_PC16b,
cutoff_PC17a, cutoff_PC17b,
cutoff_PC18a, cutoff_PC18b,
cutoff_PC19a, cutoff_PC19b,
cutoff_PC20a, cutoff_PC20b,
cutoff_PC21a, cutoff_PC21b,
cutoff_PC22a, cutoff_PC22b,
cutoff_PC23a, cutoff_PC23b,
cutoff_PC24a, cutoff_PC24b,
cutoff_PC25a, cutoff_PC25b,
cutoff_PC26a, cutoff_PC26b
]
# inspect list
pcThresholdList[0:5]
# Define thresholding code
df3["probThres_pc2"] = (
((df3[pcNameList[0]] >= pcThresholdList[0]) * 1) * pca.explained_variance_ratio_[0] +
((df3[pcNameList[1]] >= pcThresholdList[1]) * 1) * pca.explained_variance_ratio_[1] +
((df3[pcNameList[2]] >= pcThresholdList[2]) * 1) * pca.explained_variance_ratio_[2] +
((df3[pcNameList[2]] <= pcThresholdList[3]) * 1) * pca.explained_variance_ratio_[2] +
((df3[pcNameList[3]] >= pcThresholdList[4]) * 1) * pca.explained_variance_ratio_[3] +
((df3[pcNameList[3]] <= pcThresholdList[5]) * 1) * pca.explained_variance_ratio_[3] +
((df3[pcNameList[4]] >= pcThresholdList[6]) * 1) * pca.explained_variance_ratio_[4] +
((df3[pcNameList[4]] <= pcThresholdList[7]) * 1) * pca.explained_variance_ratio_[4] +
((df3[pcNameList[5]] >= pcThresholdList[8]) * 1) * pca.explained_variance_ratio_[5] +
((df3[pcNameList[5]] <= pcThresholdList[9]) * 1) * pca.explained_variance_ratio_[5] +
((df3[pcNameList[6]] >= pcThresholdList[10]) * 1) * pca.explained_variance_ratio_[6] +
((df3[pcNameList[6]] <= pcThresholdList[11]) * 1) * pca.explained_variance_ratio_[6] +
((df3[pcNameList[7]] >= pcThresholdList[12]) * 1) * pca.explained_variance_ratio_[7] +
((df3[pcNameList[7]] <= pcThresholdList[13]) * 1) * pca.explained_variance_ratio_[7] +
((df3[pcNameList[8]] >= pcThresholdList[14]) * 1) * pca.explained_variance_ratio_[8] +
((df3[pcNameList[8]] <= pcThresholdList[15]) * 1) * pca.explained_variance_ratio_[8] +
((df3[pcNameList[9]] >= pcThresholdList[16]) * 1) * pca.explained_variance_ratio_[9] +
((df3[pcNameList[9]] <= pcThresholdList[17]) * 1) * pca.explained_variance_ratio_[9] +
((df3[pcNameList[10]] >= pcThresholdList[18]) * 1) * pca.explained_variance_ratio_[10] +
((df3[pcNameList[10]] <= pcThresholdList[19]) * 1) * pca.explained_variance_ratio_[10] +
((df3[pcNameList[11]] >= pcThresholdList[20]) * 1) * pca.explained_variance_ratio_[11] +
((df3[pcNameList[11]] <= pcThresholdList[21]) * 1) * pca.explained_variance_ratio_[11] +
((df3[pcNameList[12]] >= pcThresholdList[22]) * 1) * pca.explained_variance_ratio_[12] +
((df3[pcNameList[12]] <= pcThresholdList[23]) * 1) * pca.explained_variance_ratio_[12] +
((df3[pcNameList[13]] >= pcThresholdList[24]) * 1) * pca.explained_variance_ratio_[13] +
((df3[pcNameList[13]] <= pcThresholdList[25]) * 1) * pca.explained_variance_ratio_[13] +
((df3[pcNameList[14]] >= pcThresholdList[26]) * 1) * pca.explained_variance_ratio_[14] +
((df3[pcNameList[14]] <= pcThresholdList[27]) * 1) * pca.explained_variance_ratio_[14] +
((df3[pcNameList[15]] >= pcThresholdList[28]) * 1) * pca.explained_variance_ratio_[15] +
((df3[pcNameList[15]] <= pcThresholdList[29]) * 1) * pca.explained_variance_ratio_[15] +
((df3[pcNameList[16]] >= pcThresholdList[30]) * 1) * pca.explained_variance_ratio_[16] +
((df3[pcNameList[16]] <= pcThresholdList[31]) * 1) * pca.explained_variance_ratio_[16] +
((df3[pcNameList[17]] >= pcThresholdList[32]) * 1) * pca.explained_variance_ratio_[17] +
((df3[pcNameList[17]] <= pcThresholdList[33]) * 1) * pca.explained_variance_ratio_[17] +
((df3[pcNameList[18]] >= pcThresholdList[34]) * 1) * pca.explained_variance_ratio_[18] +
((df3[pcNameList[18]] <= pcThresholdList[35]) * 1) * pca.explained_variance_ratio_[18] +
((df3[pcNameList[19]] >= pcThresholdList[36]) * 1) * pca.explained_variance_ratio_[19] +
((df3[pcNameList[19]] <= pcThresholdList[37]) * 1) * pca.explained_variance_ratio_[19] +
((df3[pcNameList[20]] >= pcThresholdList[38]) * 1) * pca.explained_variance_ratio_[20] +
((df3[pcNameList[20]] <= pcThresholdList[39]) * 1) * pca.explained_variance_ratio_[20] +
((df3[pcNameList[21]] >= pcThresholdList[40]) * 1) * pca.explained_variance_ratio_[21] +
((df3[pcNameList[21]] <= pcThresholdList[41]) * 1) * pca.explained_variance_ratio_[21] +
((df3[pcNameList[22]] >= pcThresholdList[42]) * 1) * pca.explained_variance_ratio_[22] +
((df3[pcNameList[22]] <= pcThresholdList[43]) * 1) * pca.explained_variance_ratio_[22] +
((df3[pcNameList[23]] >= pcThresholdList[44]) * 1) * pca.explained_variance_ratio_[23] +
((df3[pcNameList[23]] <= pcThresholdList[45]) * 1) * pca.explained_variance_ratio_[23] +
((df3[pcNameList[24]] >= pcThresholdList[46]) * 1) * pca.explained_variance_ratio_[24] +
((df3[pcNameList[24]] <= pcThresholdList[47]) * 1) * pca.explained_variance_ratio_[24] +
((df3[pcNameList[25]] >= pcThresholdList[48]) * 1) * pca.explained_variance_ratio_[25] +
((df3[pcNameList[25]] <= pcThresholdList[49]) * 1) * pca.explained_variance_ratio_[25]
)
# Inspect results
df3.head()
# Set Threshold probability values
pcSumThresholdProb = sum(pca.explained_variance_ratio_[0:9]).round(3)
meetsThresholdProb1 = df3['probThres_pc2'] >= pcSumThresholdProb
meetsThresholdProb2 = df3['probThres_pc2'] >= 0.8
# Inspect variable
pcSumThresholdProb
# Get Count of observations that meets threshold
print(df3[meetsThresholdProb1].shape[0])
print(df3[meetsThresholdProb2].shape[0])
# Get % count of potential contamination from non-tagged pothole events that met threshold
nonTaggedPH1 = (df3["Event"] != 1) & (meetsThresholdProb1)
nonTaggedPH2 = (df3["Event"] != 1) & (meetsThresholdProb2)
print(round(((df3[nonTaggedPH1].shape[0] / df3[meetsThresholdProb1].shape[0]) * 100), 3))
print(round(((df3[nonTaggedPH2].shape[0] / df3[meetsThresholdProb2].shape[0]) * 100), 3))
# Get a % Count of observations that met threshold
print(round(((df3[meetsThresholdProb1].shape[0] / df3.shape[0]) * 100), 3))
print(round(((df3[meetsThresholdProb2].shape[0] / df3.shape[0]) * 100), 3))
# PCA plot by Event between PC 01 and PC 02
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[1], data = df3[meetsThresholdProb1], hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 02')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 0.5, cutoff_PC02 + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC02,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC02,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 02
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (13,13))
sns.scatterplot(x = pcNameList[0], y = pcNameList[1], data = df3[meetsThresholdProb2], hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 02')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define text of intersect point(s)
plt.text(cutoff_PC01 + 0.5, cutoff_PC02 + 1, ("Cutoff Intersect Point:\n(" + str(round(cutoff_PC01,1)) + ", " + str (round(cutoff_PC02,1)) + ")"))
# Draw cut off points
plt.axvline(cutoff_PC01,0) # Draw vertical cutoff point
plt.axhline(cutoff_PC02,0) # Draw horizontal cutoff point
This step is to save time, since it takes a very long time to process the data to get to this stage
# Initialise save / load conditions
xlsfile = "../MainData/pothole_records_20200316_bike/AndroSensor_Consolidated/processedPCData_A4.xlsx"
ws = 'Data'
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
# save dataframe as excel file
df3.to_excel(xlsfile, sheet_name=ws)
# Print Execution time
print("%s" % (time.time() - start_time))
# Load Data
df3 = pd.read_excel(io = xlsfile, sheet_name = ws)
# Inspect Data
df3.head()
# Drop additional index column created
df3 = df3.loc[:, "SN": "probThres_pc2"]
# Inspect Data
df3.shape
# Inspect Data
df3.head()
There is quite a bit of overlap in the non-pothole events with pothole events. Hence, retagging may be needed.
# Set filters
supervisedFilter0 = df3['Event'] != 1
supervisedFilter1 = df3['Event'] == 1
# PCA plot by Event between PC 01 and PC 010
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (25,10))
sns.scatterplot(x = "PC_1", y = "PC_10", data = df3[supervisedFilter0], hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 10 (Non-Pothole Cases Plotted)')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define text of intersect point(s)
plt.text(25 + 1, 20 + 1, ("Cutoff Intersect Point:\n(" + str(25) + ", " + str (20) + ")"))
# Draw cut off points
plt.axvline(25,0) # Draw vertical cutoff point
plt.axhline(20,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 10
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (25,10))
sns.scatterplot(x = "PC_1", y = "PC_10", data = df3[supervisedFilter1], hue='Event', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 10 (Non-Pothole Cases Plotted)')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define text of intersect point(s)
plt.text(25 + 1, 20 + 1, ("Cutoff Intersect Point:\n(" + str(25) + ", " + str (20) + ")"))
# Draw cut off points
plt.axvline(25,0) # Draw vertical cutoff point
plt.axhline(20,0) # Draw horizontal cutoff point
Preliminary results don't look very good as there is poor separation of the variables.
# Define a selection PC for clustering
pcSelection = df3.loc[:, "PC_1":"PC_10"]
pcSelection.head()
# fit the model for outlier detection (default)
clf = LocalOutlierFactor(n_neighbors=100, contamination=0.1)
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
# use fit_predict to compute the predicted labels of the training samples
# (when LOF is used for outlier detection, the estimator has no predict,
# decision_function and score_samples methods).
y_pred = clf.fit_predict(pcSelection) # predicted outcomes of outliers (-1) and non-outliers (1)
X_scores = clf.negative_outlier_factor_ # predicted score of outliers: more -ve the value, the more likely the observation to be an outlier
# Print Execution time
print("%s" % (time.time() - start_time))
# Clean up outputs and append data
df3["LOF_pred"] = pd.Series(y_pred).replace([-1,1],[1,0]) # reclassed: 1 = outlier; 0 = non-outlier
df3["LOF_score"] = pd.Series(1 - (X_scores - min(X_scores)) / (max(X_scores) - min(X_scores))) # rescaled using min-max scaling; the close the value tends to 1, the more likely that it is an outlier
# inspect data
df3.head()
#min(df3["LOF_pred"] )
#(df3["LOF_pred"] == 0)
print(df3.shape[0])
print(len(df3[(df3["LOF_pred"] == 0)]))
print(len(df3[(df3["LOF_pred"] == 1)]))
print(len(df3[(df3["LOF_pred"] == 0)]) + len(df3[(df3["LOF_pred"] == 1)]))
# Set filters
supervisedFilter0 = df3['Event'] != 1
supervisedFilter1 = df3['Event'] == 1
# PCA plot by Event between PC 01 and PC 02
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (25,10))
sns.scatterplot(x = "PC_1", y = "PC_2", data = df3[supervisedFilter0], hue='LOF_pred', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 02 (Non-Pothole Cases Plotted)')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define text of intersect point(s)
plt.text(25 + 1, 20 + 1, ("Cutoff Intersect Point:\n(" + str(25) + ", " + str (20) + ")"))
# Draw cut off points
plt.axvline(25,0) # Draw vertical cutoff point
plt.axhline(20,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 02
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (25,10))
sns.scatterplot(x = "PC_1", y = "PC_2", data = df3[supervisedFilter1], hue='LOF_pred', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 02 (Non-Pothole Cases Plotted)')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define text of intersect point(s)
plt.text(25 + 1, 20 + 1, ("Cutoff Intersect Point:\n(" + str(25) + ", " + str (20) + ")"))
# Draw cut off points
plt.axvline(25,0) # Draw vertical cutoff point
plt.axhline(20,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 10
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (25,10))
sns.scatterplot(x = "PC_1", y = "PC_10", data = df3[supervisedFilter0], hue='LOF_pred', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 10 (Non-Pothole Cases Plotted)')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define text of intersect point(s)
plt.text(25 + 1, 20 + 1, ("Cutoff Intersect Point:\n(" + str(25) + ", " + str (20) + ")"))
# Draw cut off points
plt.axvline(25,0) # Draw vertical cutoff point
plt.axhline(20,0) # Draw horizontal cutoff point
This step is to save time, since it takes a very long time to process the data to get to this stage
# Initialise save / load conditions
xlsfile = "../MainData/pothole_records_20200316_bike/AndroSensor_Consolidated/processedPCData_A5b.xlsx" # a file variant with the LOF cluster data is loaded instead
ws = 'Data'
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
# save dataframe as excel file
df3.to_excel(xlsfile, sheet_name=ws)
# Print Execution time
print("%s" % (time.time() - start_time))
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
# Load Data
df3 = pd.read_excel(io = xlsfile, sheet_name = ws)
# Inspect Data
df3.head()
# Print Execution time
print("%s" % (time.time() - start_time))
# Drop additional index column created
df3 = df3.loc[:, "SN": "LOF_score"]
# Inspect Data
df3.shape
# Inspect Data
df3.head()
References:
# Inspect Data
df3.head()
# Inspect Selection
cluster_data = df3.loc[::, "PC_1": "PC_10"]
cluster_data
Maximum number of clusters generated = 4
Speculation: Could be due to the lack of samples of other events such as speed stripes, humps and stationary.
Number of clusters generated: 15+1
WARNING: This takes around 20min to run
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
# Perform HDBSCAN on selected data
clusterer = hdbscan.HDBSCAN(min_cluster_size=1000, gen_min_span_tree=True)
clusterer.fit(cluster_data)
# Print Execution time
print("%s" % (time.time() - start_time))
# Extract clusters
plt.figure(figsize = (25,10)) # Control figure size
clusterer.condensed_tree_.plot(select_clusters=True, selection_palette=sns.color_palette())
clusterLabels = clusterer.labels_
clusterLabels_Prob = clusterer.probabilities_
print(max(clusterLabels))
print(min(clusterLabels))
Clustering is not giving very good results, misclassifying a fair bit of the data is outliers.
Potential work around solution:
# Append cluster data to main dataset
df3["HDBSCAN_label"] = clusterLabels
df3["HDBSCAN_prob"] = clusterLabels_Prob
# Inspect data
df3.head()
# Set filters
supervisedFilter10 = df3['HDBSCAN_label'] != -1
supervisedFilter11 = df3['HDBSCAN_label'] == -1
# PCA plot by Event between PC 01 and PC 02
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (25,10))
sns.scatterplot(x = "PC_1", y = "PC_10", data = df3[supervisedFilter10], hue='HDBSCAN_label', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 10 (Non-Pothole Cases Plotted)')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define text of intersect point(s)
plt.text(25 + 1, 20 + 1, ("Cutoff Intersect Point:\n(" + str(25) + ", " + str (20) + ")"))
# Draw cut off points
plt.axvline(25,0) # Draw vertical cutoff point
plt.axhline(20,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 02
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (25,10))
sns.scatterplot(x = "PC_1", y = "PC_10", data = df3[supervisedFilter11], hue='HDBSCAN_label', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 10 (Non-Pothole Cases Plotted)')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define text of intersect point(s)
plt.text(25 + 1, 20 + 1, ("Cutoff Intersect Point:\n(" + str(25) + ", " + str (20) + ")"))
# Draw cut off points
plt.axvline(25,0) # Draw vertical cutoff point
plt.axhline(20,0) # Draw horizontal cutoff point
This step is to save time, since it takes a very long time to process the data to get to this stage
# Initialise save / load conditions
xlsfile = "../MainData/pothole_records_20200316_bike/AndroSensor_Consolidated/processedPCData_A5c.xlsx" # a file variant with the LOF cluster data is loaded instead
ws = 'Data'
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
# save dataframe as excel file
df3.to_excel(xlsfile, sheet_name=ws)
# Print Execution time
print("%s" % (time.time() - start_time))
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
# Load Data
df3 = pd.read_excel(io = xlsfile, sheet_name = ws)
# Print Execution time
print("%s" % (time.time() - start_time))
# Inspect Data
df3.head()
# Drop additional index column created
df3 = df3.loc[:, "SN": "HDBSCAN_prob"]
# Inspect Data
df3.shape
# Inspect Data
df3.head()
# Inspect Selection
cluster_data = df3.loc[::, "PC_1": "PC_10"]
cluster_data
# Finding optimum number of clusters using the elbow plot
Sum_of_squared_distances = []
K = range(1,10)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(cluster_data)
Sum_of_squared_distances.append(km.inertia_)
# Plot elbow plot
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
# Generate Silhouette plots
# Ideally the values should be positive and tending towards 1 for a good split
# This step takes VERY long if your dataset is large
kmeans = KMeans(n_clusters = 5, random_state = 20200504, max_iter = 500)
kmeans_preds = kmeans.fit_predict(cluster_data)
score = silhouette_score(cluster_data, kmeans_preds)
score
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
# Perform k means clustering
kmeans = KMeans(n_clusters = 5, random_state = 20200504, max_iter = 500)
kmeans.fit(cluster_data)
# Print Execution time
print("%s" % (time.time() - start_time))
kmeans_labels = kmeans.labels_
print(min(kmeans_labels))
print(max(kmeans_labels))
# Append cluster data to main dataset
df3["KMeans_label"] = kmeans_labels
# Inspect data
df3.head()
# Set filters
supervisedFilter0 = df3['Event'] != 1
supervisedFilter1 = df3['Event'] == 1
# PCA plot by Event between PC 01 and PC 10
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (25,10))
sns.scatterplot(x = "PC_1", y = "PC_10", data = df3[supervisedFilter0], hue='KMeans_label', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 10 (Non-Pothole Cases Plotted)')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define text of intersect point(s)
plt.text(25 + 1, 20 + 1, ("Cutoff Intersect Point:\n(" + str(25) + ", " + str (20) + ")"))
# Draw cut off points
plt.axvline(25,0) # Draw vertical cutoff point
plt.axhline(20,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 10
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (25,10))
sns.scatterplot(x = "PC_1", y = "PC_10", data = df3[supervisedFilter1], hue='KMeans_label', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 10 (Non-Pothole Cases Plotted)')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define text of intersect point(s)
plt.text(25 + 1, 20 + 1, ("Cutoff Intersect Point:\n(" + str(25) + ", " + str (20) + ")"))
# Draw cut off points
plt.axvline(25,0) # Draw vertical cutoff point
plt.axhline(20,0) # Draw horizontal cutoff point
Results of the clustering are consistent. Validation passed.
train_data, test_data = train_test_split(df3, test_size=0.3, random_state=20200504)
print(train_data.shape)
print(test_data.shape)
# Train Data
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
# Perform k means clustering
kmeans_train = KMeans(n_clusters = 5, random_state = 20200504, max_iter = 500)
kmeans_train.fit(train_data.loc[::, "PC_1": "PC_10"])
# Print Execution time
print("%s" % (time.time() - start_time))
# Test Data
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
# Perform k means clustering
kmeans_test = KMeans(n_clusters = 5, random_state = 20200504, max_iter = 500)
kmeans_test.fit(test_data.loc[::, "PC_1": "PC_10"])
# Print Execution time
print("%s" % (time.time() - start_time))
# Train Data
kmeans_labels_train = kmeans_train.labels_
print("Number of clusters from train data:")
print(min(kmeans_labels_train))
print(max(kmeans_labels_train))
# Test Data
kmeans_labels_test = kmeans_test.labels_
print("Number of clusters from test data:")
print(min(kmeans_labels_test))
print(max(kmeans_labels_test))
# Append cluster data to main dataset
train_data["KMeans_label_train"] = kmeans_labels_train
# Inspect data
train_data.head()
# Append cluster data to main dataset
test_data["KMeans_label_test"] = kmeans_labels_test
# Inspect data
test_data.head()
# PCA plot by Event between PC 01 and PC 10
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (25,10))
sns.scatterplot(x = "PC_1", y = "PC_10", data = train_data, hue='KMeans_label_train', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 10 (Non-Pothole Cases Plotted)')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define text of intersect point(s)
plt.text(25 + 1, 20 + 1, ("Cutoff Intersect Point:\n(" + str(25) + ", " + str (20) + ")"))
# Draw cut off points
plt.axvline(25,0) # Draw vertical cutoff point
plt.axhline(20,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 10
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (25,10))
sns.scatterplot(x = "PC_1", y = "PC_10", data = test_data, hue='KMeans_label_test', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 10 (Non-Pothole Cases Plotted)')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define text of intersect point(s)
plt.text(25 + 1, 20 + 1, ("Cutoff Intersect Point:\n(" + str(25) + ", " + str (20) + ")"))
# Draw cut off points
plt.axvline(25,0) # Draw vertical cutoff point
plt.axhline(20,0) # Draw horizontal cutoff point
# Remove unwanted variables
del train_data
del test_data
del kmeans_labels_train
del kmeans_labels_test
This step is to save time, since it takes a very long time to process the data to get to this stage
# Initialise save / load conditions
xlsfile = "../MainData/pothole_records_20200316_bike/AndroSensor_Consolidated/processedPCData_A5d.xlsx" # a file variant with the LOF cluster data is loaded instead
ws = 'Data'
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
# save dataframe as excel file
df3.to_excel(xlsfile, sheet_name=ws)
# Print Execution time
print("%s" % (time.time() - start_time))
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
# Load Data
df3 = pd.read_excel(io = xlsfile, sheet_name = ws)
# Print Execution time
print("%s" % (time.time() - start_time))
# Inspect Data
df3.head()
# Drop additional index column created
df3 = df3.loc[:, "SN": "KMeans_label"]
# Inspect Data
df3.shape
# Inspect Data
df3.head()
# Inspect Selection
cluster_data = df3.loc[::, "PC_1": "PC_10"]
cluster_data
# Find ideal epsilon value
neigh = NearestNeighbors(n_neighbors=10)
nbrs = neigh.fit(cluster_data)
distances, indices = nbrs.kneighbors(cluster_data)
# plot elbow plot to find inflexion point
plt.figure(figsize = (10,10))
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
#DBSCAN
DBSCAN_clustering = DBSCAN(eps=2, min_samples=100).fit(cluster_data)
DBSCAN_label = DBSCAN_clustering.labels_
# Print Execution time
print("%s" % (time.time() - start_time))
max(DBSCAN_label)+2
# Append cluster data to main dataset
df3["DBSCAN_label"] = DBSCAN_label
# Inspect data
df3.head()
# Set filters
supervisedFilter0 = df3['Event'] != 1
supervisedFilter1 = df3['Event'] == 1
# PCA plot by Event between PC 01 and PC 10
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (25,10))
sns.scatterplot(x = "PC_1", y = "PC_10", data = df3[supervisedFilter0], hue='DBSCAN_label', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 10 (Non-Pothole Cases Plotted)')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define text of intersect point(s)
plt.text(25 + 1, 20 + 1, ("Cutoff Intersect Point:\n(" + str(25) + ", " + str (20) + ")"))
# Draw cut off points
plt.axvline(25,0) # Draw vertical cutoff point
plt.axhline(20,0) # Draw horizontal cutoff point
# PCA plot by Event between PC 01 and PC 10
# Colour coded by event labels
# Use the 'hue' argument to provide a factor variable
plt.figure(figsize = (25,10))
sns.scatterplot(x = "PC_1", y = "PC_10", data = df3[supervisedFilter1], hue='DBSCAN_label', alpha=0.3, palette = "muted").set_title('PCA plot by Event between PC 01 and PC 10 (Non-Pothole Cases Plotted)')
# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
# Define text of intersect point(s)
plt.text(25 + 1, 20 + 1, ("Cutoff Intersect Point:\n(" + str(25) + ", " + str (20) + ")"))
# Draw cut off points
plt.axvline(25,0) # Draw vertical cutoff point
plt.axhline(20,0) # Draw horizontal cutoff point
This step is to save time, since it takes a very long time to process the data to get to this stage
# Initialise save / load conditions
xlsfile = "../MainData/pothole_records_20200316_bike/AndroSensor_Consolidated/processedPCData_A5e.xlsx" # a file variant with the LOF cluster data is loaded instead
ws = 'Data'
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
# save dataframe as excel file
df3.to_excel(xlsfile, sheet_name=ws)
# Print Execution time
print("%s" % (time.time() - start_time))
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
# Load Data
df3 = pd.read_excel(io = xlsfile, sheet_name = ws)
# Print Execution time
print("%s" % (time.time() - start_time))
# Inspect Data
df3.head()
# Drop additional index column created
df3 = df3.loc[:, "SN": "DBSCAN_label"]
# Inspect Data
df3.shape
# Inspect Data
df3.head()
So for K-means clustering and thresholding methods produces the best results, so we will be using them as an ensemble voting system to identify potholes. The 3 methods have been sorted in descending order of filter aggressiveness.
For both thresholding methods, we will set the cut off weighted probability to be 70% to be considered as a potential pothole. The minimum combined votes needed to be considered a pothole would be taken as between 2-3. These results would then be pushed to a HDBSCAN + DBSCAN algorithmn to cluster them on a geospatial map.
We will apply a voting method of the best 3 methods to determine the possibility of a pothole
# Define cutoff points
thresholdCutoffProb = 0.7
thresholdCutoffProb2 = 0.8
keyCluster_kmeans = 3
# Voting formula
df3["Vote"] = ((df3["probThres_pc"] >= thresholdCutoffProb) * 1) + ((df3["probThres_pc2"] >= thresholdCutoffProb2) * 1) + ((df3["KMeans_label"] == keyCluster_kmeans) * 1)
# Inspect results
df3.head()
# filter observations by votes
voteFilter = df3['Vote'] >= 2
# Inspect data
print(df3[voteFilter].shape[0]) # total count
print(str(round(df3[voteFilter].shape[0] / df3.shape[0] * 100, 3))+"%") # Total % of data
# filter observations by votes
voteFilter2 = df3['Vote'] >= 3
# Inspect data
print(df3[voteFilter2].shape[0]) # total count
print(str(round(df3[voteFilter2].shape[0] / df3.shape[0] * 100, 3))+"%") # Total % of data
This would serve as the crowdsensing module. Run this as one continuous block.
# Convert GPS Coordindates to radians
df3["Latitude_rad"] = np.radians(df3["LOCATION Latitude : "])
df3["Longitude_rad"] = np.radians(df3["LOCATION Longitude : "])
# Inspect results
df3.head()
# Find median GPS Accuracy of observations with 2 or more votes (in m)
GPS_accuracy = df3[df3["Vote"] >= 2].loc[::, "LOCATION Accuracy ( m)"].quantile(.5)
earth_radius = 6378 # earth's radius at the equator in km (SG very near to the equator so I'll peg my results to it)
epsilon = GPS_accuracy / 1000 / earth_radius #calculate 5 meter epsilon threshold
min_cluster_size = 5
# Define selection
df_geo1 = df3[df3["Vote"] >= 2]
df_geo1 = df_geo1[["SN", "LOCATION Latitude : ", "LOCATION Longitude : ", "Latitude_rad", "Longitude_rad", "Vote"]]
# Inspect results
df_geo1
# Define selection
df_geo2 = df3[df3["Vote"] == 3]
df_geo2 = df_geo2[["SN", "LOCATION Latitude : ", "LOCATION Longitude : ", "Latitude_rad", "Longitude_rad", "Vote"]]
# Inspect results
df_geo2
# Apply clustering
clusterer1 = hdbscan.HDBSCAN(min_cluster_size = min_cluster_size, metric='haversine', cluster_selection_epsilon=epsilon, cluster_selection_method = 'eom')
predictions1 = clusterer.fit_predict(df_geo1.loc[::, "Latitude_rad":"Longitude_rad"])
# Number of clusters
print(max(predictions1) + 2)
# Append results to dataframe
df_geo1["Cluster1"] = predictions1
# Inspect results
df_geo1
# Apply clustering
clusterer2 = hdbscan.HDBSCAN(min_cluster_size = min_cluster_size, metric='haversine', cluster_selection_epsilon=epsilon, cluster_selection_method = 'eom')
predictions2 = clusterer.fit_predict(df_geo2.loc[::, "Latitude_rad":"Longitude_rad"])
# Number of clusters
print(max(predictions2) + 2)
# Append results to dataframe
df_geo2["Cluster2"] = predictions2
# Inspect results
df_geo2
# Merge data frame
df_geo1 = pd.merge(df_geo1, df_geo2[['SN', 'Cluster2']], on='SN', how='left')
# Inspect results
df_geo1
# Delete redundant variable
del df_geo2
# Reformat latitude and longtiude info
df_geo1['point'] = df_geo1.apply(lambda row: Point(latitude=row["LOCATION Latitude : "], longitude=row["LOCATION Longitude : "]), axis=1)
# inspect data
df_geo1.head()
# List of ground of truth coordinates
GOT_latitude = [1.32206,
1.32206,
1.32208,
1.32209,
1.3224,
1.3224,
1.3224,
1.3224,
1.32241,
1.3229,
1.32294,
1.32297,
1.32301,
1.32307,
1.32329,
1.32582,
1.3263,
1.32631,
1.32632,
1.34958,
1.34972,
1.34988,
1.35734,
1.37995,
1.40562,
1.4135,
1.41352,
1.41478,
1.41498,
1.41514,
1.41523,
1.41539,
1.4157
]
GOT_longitude = [103.67308,
103.67311,
103.67309,
103.67338,
103.67826,
103.67821,
103.67823,
103.67825,
103.67855,
103.65484,
103.65459,
103.65464,
103.65486,
103.65462,
103.6546,
103.6564,
103.65608,
103.65637,
103.65608,
103.70386,
103.70384,
103.70375,
103.69518,
103.73146,
103.82144,
103.81522,
103.79288,
103.80462,
103.80455,
103.80405,
103.80384,
103.80391,
103.80059,
]
# Generate Points as a List
counter = 0
GOT_coordinates = []
for item in GOT_latitude:
new_coordinates = Point(latitude = item, longitude = GOT_longitude[counter])
counter = counter + 1
GOT_coordinates.append(new_coordinates)
# Append results as a dataframe (Extra step for convenience)
df_GOT = pd.DataFrame({
"latitude": GOT_latitude,
"longitude": GOT_longitude,
"point": GOT_coordinates
})
# Inspect ressults
df_GOT.head()
len(GOT_coordinates)
suspected_potholes = df_geo1["point"].tolist()
len(suspected_potholes)
def more_than_zero (number):
return number > 0
# Evaluate suspected points against estimated ground of truth
# Threshold Distance 1
threshold_dist = 15 # meters
thresholdCount_list = []
for suspect in suspected_potholes:
thresholdCounter = 0
for truth in GOT_coordinates:
if ((distance.distance(truth, suspect).m) <= threshold_dist):
thresholdCounter = thresholdCounter + 1
else:
thresholdCounter = thresholdCounter
thresholdCount_list.append(thresholdCounter)
print("Total Count:")
print(len(thresholdCount_list))
print("Total Points on Target (Cluster Set 1):")
print(len(list(filter(more_than_zero, thresholdCount_list))))
print("Percentage of Points on Target (Cluster Set 1):")
print(
str(
round(len(list(filter(more_than_zero, thresholdCount_list))) / len(thresholdCount_list) * 100,3)
) + "%"
)
# Evaluate suspected points against estimated ground of truth
# Threshold Distance 2
threshold_dist2 = 10 # meters
thresholdCount_list2 = []
for suspect in suspected_potholes:
thresholdCounter = 0
for truth in GOT_coordinates:
if ((distance.distance(truth, suspect).m) <= threshold_dist):
thresholdCounter = thresholdCounter + 1
else:
thresholdCounter = thresholdCounter
thresholdCount_list2.append(thresholdCounter)
print("Total Count:")
print(len(thresholdCount_list2))
print("Total Points on Target (Cluster Set 1):")
print(len(list(filter(more_than_zero, thresholdCount_list2))))
print("Percentage of Points on Target (Cluster Set 1):")
print(
str(
round(len(list(filter(more_than_zero, thresholdCount_list2))) / len(thresholdCount_list2) * 100,3)
) + "%"
)
# Append results to dataframe
df_geo1["onTarget_0"] = pd.Series(thresholdCount_list)
df_geo1["onTarget_1"] = pd.Series(thresholdCount_list2)
# Inspect results
df_geo1.head()
# Select Cluster1 Points Not Deemed As Outliers
df_geo1b = df_geo1[df_geo1["Cluster1"] > -1]
# Select Cluster2 Points
df_geo2 = df_geo1[df_geo1["Cluster2"].notnull()]
# Select Cluster2 Points Not Deemed As Outliers
df_geo2b = df_geo2[df_geo2["Cluster2"] > -1]
#len(df_geo1[df_geo1["onTarget_0"] > 0])
len(df_geo1[df_geo1["onTarget_1"] > 0])
# Get breakdown of shots on target Cluster 1
print("Total Count of Cluster 1 (All):")
print(
df_geo1.shape[0]
)
print("Perccentage of Points on Target (" + str(threshold_dist) + "m radius):")
print(
str(
round(len(df_geo1[df_geo1["onTarget_0"] > 0]) / df_geo1.shape[0] * 100
,3
)
) + "%"
)
print("Perccentage of Points on Target (" + str(threshold_dist2) + "m radius):")
print(
str(
round(len(df_geo1[df_geo1["onTarget_1"] > 0]) / df_geo1.shape[0] * 100
,3
)
) + "%"
)
print("\nTotal Count of Cluster 1 (Outliers Removed):")
print(
df_geo1b.shape[0]
)
print("Perccentage of Points on Target (" + str(threshold_dist) + "m radius):")
print(
str(
round(len(df_geo1b[df_geo1b["onTarget_0"] > 0]) / df_geo1b.shape[0] * 100
,3
)
) + "%"
)
print("Perccentage of Points on Target (" + str(threshold_dist2) + "m radius):")
print(
str(
round(len(df_geo1b[df_geo1b["onTarget_1"] > 0]) / df_geo1b.shape[0] * 100
,3
)
) + "%"
)
# Get breakdown of shots on target Cluster 2
print("Total Count of Cluster 2 (All):")
print(
df_geo2.shape[0]
)
print("Perccentage of Points on Target (" + str(threshold_dist) + "m radius):")
print(
str(
round(len(df_geo2[df_geo2["onTarget_0"] > 0]) / df_geo2.shape[0] * 100
,3
)
) + "%"
)
print("Perccentage of Points on Target (" + str(threshold_dist2) + "m radius):")
print(
str(
round(len(df_geo2[df_geo2["onTarget_1"] > 0]) / df_geo2.shape[0] * 100
,3
)
) + "%"
)
print("\nTotal Count of Cluster 1 (Outliers Removed):")
print(
df_geo2b.shape[0]
)
print("Perccentage of Points on Target (" + str(threshold_dist) + "m radius):")
print(
str(
round(len(df_geo2b[df_geo2b["onTarget_0"] > 0]) / df_geo2b.shape[0] * 100
,3
)
) + "%"
)
print("Perccentage of Points on Target (" + str(threshold_dist2) + "m radius):")
print(
str(
round(len(df_geo2b[df_geo2b["onTarget_1"] > 0]) / df_geo2b.shape[0] * 100
,3
)
) + "%"
)
This step is to save time, since it takes a very long time to process the data to get to this stage
# Initialise save / load conditions
xlsfile = "../MainData/pothole_records_20200316_bike/AndroSensor_Consolidated/processedPCData_A5f1.xlsx" # a file variant with the LOF cluster data is loaded instead
ws = 'Data'
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
# save dataframe as excel file
df_geo1.to_excel(xlsfile, sheet_name=ws)
# Print Execution time
print("%s" % (time.time() - start_time))
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
# Load Data
df_geo1 = pd.read_excel(io = xlsfile, sheet_name = ws)
# Print Execution time
print("%s" % (time.time() - start_time))
# Inspect Data
df_geo1.head()
# Drop additional index column created
df_geo1 = df_geo1.loc[:, "SN": "Cluster2"]
# Inspect Data
df_geo1.shape
# Inspect Data
df_geo1.head()
Append geospatial clustering info to main dataset
# Merge data
df3 = pd.merge(df3, df_geo1[['SN', 'Cluster1', 'Cluster2']], on='SN', how='left')
# inspect data
df3.head()
df3.shape
This step is to save time, since it takes a very long time to process the data to get to this stage
# Initialise save / load conditions
xlsfile = "../MainData/pothole_records_20200316_bike/AndroSensor_Consolidated/processedPCData_A5f2.xlsx" # a file variant with the LOF cluster data is loaded instead
ws = 'Data'
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
# save dataframe as excel file
df3.to_excel(xlsfile, sheet_name=ws)
# Print Execution time
print("%s" % (time.time() - start_time))
# Measure time taken to run function so we can figure out how long to take a break
start_time = time.time()
# Load Data
df3 = pd.read_excel(io = xlsfile, sheet_name = ws)
# Print Execution time
print("%s" % (time.time() - start_time))
# Inspect Data
df3.head()
# Drop additional index column created
df3 = df3.loc[:, "SN": "Cluster2"]
# Inspect Data
df3.shape
# Inspect Data
df3.head()